{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.3 回归分析\n",
    "\n",
    "**回归分析**：分析不同变量之间存在关系的研究。   \n",
    "**回归模型**：刻画不同变量之间关系的模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.1 回归分析的基本概念\n",
    "\n",
    "**数据**：下表给出了莫纳罗亚山从 1970 年到 2005 年间每 5 年的二氧化碳浓度，单位是百万分比浓度（parts per million，简称ppm）\n",
    "\n",
    "<table>\n",
    "    <h4 align=\"center\">莫纳罗亚山从 1970 年到 2005 年间每 5 年的二氧化碳浓度</h4>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">**年份 $x$ ** </th>\n",
    "        <td align=\"center\">1970</td>\n",
    "        <td align=\"center\">1975</td>\n",
    "        <td align=\"center\">1980</td> \n",
    "        <td align=\"center\">1985</td>\n",
    "        <td align=\"center\">1990</td>\n",
    "        <td align=\"center\">1995</td>\n",
    "        <td align=\"center\">2000</td>\n",
    "        <td align=\"center\">2005</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">**$CO_2$(ppm) $y$**</th>\n",
    "        <td align=\"center\">325.68</td>\n",
    "        <td align=\"center\">331.15</td>\n",
    "        <td align=\"center\">338.69</td> \n",
    "        <td align=\"center\">345.90</td>\n",
    "        <td align=\"center\">354.19</td>\n",
    "        <td align=\"center\">360.88</td>\n",
    "        <td align=\"center\">369.48</td>\n",
    "        <td align=\"center\">379.67</td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "\n",
    "**目标**：分析时间年份和二氧化碳浓度之间的关联关系，由此预测2010年二氧化碳浓度。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "x = np.array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005])\n",
    "y = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Year\")\n",
    "plt.ylabel(\"Co2\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该地区二氧化碳浓度在逐年缓慢增加，因此我们使用简单的**线性模型**来刻画时间年份和二氧化碳浓度两者之间的关系，即 $二氧化碳浓度 = a × 时间 + b$。 \n",
    "\n",
    "设时间年份为 $x$，二氧化碳浓度为 $y$，即 $y = ax + b$ 。\n",
    "\n",
    "通过上述数据来确定模型中 $a$ 和 $b$ 的值，一旦求解出 $a$ 和 $b$ 的值，输入任意的时间年份即可估算出该年份对应的二氧化碳浓度值。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.2 回归分析中参数计算\n",
    "\n",
    "最简单的线性回归是**一元线性回归模型**，只包含一个自变量 $x$ 和一个因变量 $y$，并且假定自变量和因变量之间存在 $y=ax+b$ 的线性关系。求解参数 $a$ 和 $b$，需要给定若干组 $(x,y)$ 数据，然后从这些数据出发来计算参数 $a$ 和 $b$。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在一元线性回归模型中，最关键的问题是如何计算参数 $a$ 和参数 $b$ 使误差最小化。\n",
    "\n",
    "最拟合直线  $y=ax+b$ 应该与这 8 组样本数据点距离都很近，最好的情况是这些样本数据点都在该直线上（不现实），让所有样本数据点离直线尽可能的近（被定义为预测数值和实际数值之间的差）。\n",
    "\n",
    "**预测值**：通过给定参数 $a$ 和 $b$ 计算 $ax+b$ 得到的值记为 $\\widetilde{y}=ax+b$\n",
    "\n",
    "**真实值**：每组数据中 $(x,y)$ 中对应的 $y$ 值\n",
    "\n",
    "**残差**：作为 $x$ 所对应的真实值 $y$ 和模型预测值 $\\widetilde{y}$ 之间误差的绝对值；在实际中一般使用$(y-\\widetilde{y})^2$作为残差。\n",
    "\n",
    "回归分析中，对于不同的参数，最佳回归模型是最小化残差平方和的均值，即要求 N 组 $(x,y)$ 数据得到的残差平均值 $\\frac{1}{N}\\sum{（y-\\widetilde{y}）^2}$ 最小。\n",
    "\n",
    "因此，给定的  8组 $(x,y)$数据，可通过最小二乘法来求解使得残差最小的 $a$ 和 $b$。\n",
    "\n",
    "8组 $(x,y)$ 样本数据点记为 $(x_1,y_1)$， $(x_2,y_2)$， ...， $(x_8,y_8)$， 时间年份变量 $x$ 的平均值 $\\overline{x}=\\frac{x_1+x_2+...+x_8}{8}$， 因变量 $y$ 的平均值为$\\overline{y}=\\frac{y_1+y_2+...+y_8}{8}$， 则：\n",
    "\n",
    "$a=\\frac{x_{1}y_{1}+x_2y_2+...+x_8y_8-8\\overline{x}\\overline{y}}{x_{1}^{2}+x_{2}^{2}+...+x_{8}^{2}-8\\overline{x}^2}=1.5344$\n",
    "\n",
    "$b = \\overline{y}-a\\overline{x}=-2698.9$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们根据上面的公式编写如下的方法来求解 $a$ 和 $b$。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cal_a_b(x, y):\n",
    "    \"\"\"\n",
    "    计算 x 和 y 的线性系数\n",
    "    :param x: np array 格式的自变量\n",
    "    :param y: np array 格式的因变量\n",
    "    :return: 系数 a 和 b\n",
    "    \"\"\"\n",
    "    # 计算 x 和 y 的平均数\n",
    "    x_avarage = np.sum(x) / len(x)\n",
    "    y_avarage = np.sum(y) / len(y)\n",
    "\n",
    "    # 两个临时变量用于后续计算参数 a 和 b\n",
    "    # m 为 x1*y1 + x2*y2 + ... \n",
    "    # n 为 x1*x1 + x2*x2 + ... \n",
    "    m = np.sum(x * y)\n",
    "    n = np.sum(x ** 2)\n",
    "\n",
    "    # 计算参数 a 和 b\n",
    "    a = (m - len(x) * x_avarage * y_avarage) / (\n",
    "                n - len(x) * x_avarage * x_avarage)\n",
    "    b = y_avarage - a * x_avarage\n",
    "    return a, b\n",
    "a, b = cal_a_b(x, y)\n",
    "print(a, b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上：预测莫纳罗亚山地区二氧化碳浓度的一元线性回归模型为：$y=1.5344x-2698.9$。  \n",
    "我们可以据此绘制出拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(1965, 2010, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Year\")\n",
    "plt.ylabel(\"Co2\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们可以对该地区1970年之前和2005年之后的二氧化碳浓度进行估算。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例如，预测 2015 年的二氧化碳浓度\n",
    "a * 2015 + b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最终的预测结果汇总如下：  \n",
    "\n",
    "<table>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">**年份 $x$ ** </th>\n",
    "        <td align=\"center\">1960</td>\n",
    "        <td align=\"center\">1965</td>\n",
    "        <td align=\"center\">1970-2005</td> \n",
    "        <td align=\"center\">2010</td>\n",
    "        <td align=\"center\">2015</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">**$CO_2$(ppm) $y$**</th>\n",
    "        <td align=\"center\">308.51</td>\n",
    "        <td align=\"center\">316.18</td>\n",
    "        <td align=\"center\">已有数据</td> \n",
    "        <td align=\"center\">385.23</td>\n",
    "        <td align=\"center\">392.90</td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 探究莫纳罗亚山地区二氧化碳与温度之间的关系\n",
    "\n",
    "该地区 1970 年到 2005 年间每 5 年的二氧化碳浓度以及全球温度（相对于 1961 - 1990 年经过平滑处理的平均温度增长量）\n",
    "\n",
    "<table>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">$CO_2$(ppm) $x$</th>\n",
    "        <td align=\"center\">325.68</td>\n",
    "        <td align=\"center\">331.15</td>\n",
    "        <td align=\"center\">338.69</td> \n",
    "        <td align=\"center\">345.90</td>\n",
    "        <td align=\"center\">354.19</td>\n",
    "        <td align=\"center\">360.88</td>\n",
    "        <td align=\"center\">369.48</td>\n",
    "        <td align=\"center\">379.67</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">温度 $y$ </th>\n",
    "        <td align=\"center\">-0.108</td>\n",
    "        <td align=\"center\">-0.082</td>\n",
    "        <td align=\"center\">0.015</td>\n",
    "        <td align=\"center\">0.080</td>\n",
    "        <td align=\"center\">0.149</td>\n",
    "        <td align=\"center\">0.240</td>\n",
    "        <td align=\"center\">0.370</td>\n",
    "        <td align=\"center\">0.420</td>\n",
    "\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以使用上面同样的方法来求解得到参数 $a$ 和 $b$。并绘制出拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据\n",
    "x = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n",
    "y = np.array([-0.108, -0.082, 0.015, 0.080, 0.149, 0.24, 0.370, 0.420])\n",
    "\n",
    "# 计算参数 a 和 b\n",
    "a, b = cal_a_b(x, y)\n",
    "\n",
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(325, 380, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Co2\")\n",
    "plt.ylabel(\"Temperature\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 思考与练习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 摄氏温度（℃）和华氏温度（℉）是两种计量温度的标准，下表给出了两种温度之间的若干关系，如摄氏温度 0℃ 等于华氏温度 32℉。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "    <h4 align=\"center\">不同温度下测得摄氏/华氏温度表</h4>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">摄氏温度（℃） </th>\n",
    "        <td align=\"center\">0</td>\n",
    "        <td align=\"center\">10</td>\n",
    "        <td align=\"center\">15</td> \n",
    "        <td align=\"center\">20</td>\n",
    "        <td align=\"center\">25</td>\n",
    "        <td align=\"center\">30</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">华氏温度（℉）</th>\n",
    "        <td align=\"center\">32</td>\n",
    "        <td align=\"center\">50</td>\n",
    "        <td align=\"center\">59</td> \n",
    "        <td align=\"center\">68</td>\n",
    "        <td align=\"center\">77</td>\n",
    "        <td align=\"center\">86</td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "试判断摄氏温度和华氏温度之间是否符合线性关系。如符合，请通过线性回归分析计算出摄氏温度和华氏温度之间的线性回归方程。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先：我们观察一下摄氏华氏温度的散点图"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据\n",
    "x = np.array([0, 10, 15, 20, 25, 30])\n",
    "y = np.array([32, 50, 59, 68, 77, 86])\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"摄氏温度\")\n",
    "plt.ylabel(\"华氏温度\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过散点图，我们观察到摄氏温度和华氏温度是符合线性关系的。使用我们上面写好求解参数的方法，即可快速求解。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a, b = cal_a_b(x, y)\n",
    "print('参数 a 的值为：{:g}，参数 b 的值为：{:g}'.format(a, b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(0, 30, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"摄氏温度\")\n",
    "plt.ylabel(\"华氏温度\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 摩尔定律是由英特尔创始人之一的戈登·摩尔提出，其基本内容为：当价格不变时，集成电路上可容纳的元器件的数目，大约每隔 18-24 个月变会增加一倍，性能也将提升一倍。下表记录了 1971-2004 年英特尔微处理器晶体管数量的增长。需要注意的是，随着单位面积上晶体管体积越来越小，摩尔定律所描述的晶体管增长在不久的将来会面临发展的极限。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|微处理器|推出年份($x$)|晶体管数量($y$)|$z$=log<sub>2</sub>$y$|\n",
    "|--|--|--|--|\n",
    "|4004|1971|2300|11.17|\n",
    "|8008|1972|2500|11.29|\n",
    "|8080|1974|4500|12.14|\n",
    "|8086|1978|29000|14.82|\n",
    "|Intel266|1982|134000|17.03|\n",
    "|Intel386~processor|1985|275000|18.07|\n",
    "|Intel486~processor|1989|1200000|20.19|\n",
    "|Intel Pentium processor|1993|3100000|21.56|\n",
    "|Intel Pentium Ⅱ processor|1997|7500000|22.84|\n",
    "|Intel Pentium Ⅲ processor|1999|9500000|23.18|\n",
    "|Intel Pentium 4 processor|2000|42000000|25.32|\n",
    "|Intel Itanium processor|2001|25000000|24.58|\n",
    "|Intel Itanium 2 processor|2003|220000000|27.72|\n",
    "|Intel Itanium 2 processor(9MB cache)|2004|592000000|29.14|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "摩尔定律刻画了晶体管数量与时间之间存在指数关系，可用非线性回归拟合来表示这种关系，非线性回归拟合超出了本教程的内容范围。不过我们可以对晶体管数量取以 2 为底的对数（记为 $z$ ），通过判断 $z$ 与时间 $x$ 之间是否存在线性关系，来验证摩尔定律。如果上述线性关系存在，使用线性回归方法计算之间的最佳拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 年份\n",
    "x = np.array(\n",
    "    [1971, 1972, 1974, 1978, 1982, 1985, 1989, 1993, 1997, 1999, 2000, 2001,\n",
    "     2003, 2004])\n",
    "# 晶体管取以 2 为底的对数\n",
    "z = np.array(\n",
    "    [11.17, 11.29, 12.14, 14.82, 17.03, 18.07, 20.19, 21.56, 22.84, 23.18,\n",
    "     25.32, 24.58, 27.72, 29.14])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们绘图观察 $x$ 和 $z$ 之间的关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "plt.xlabel(\"年份\")\n",
    "plt.ylabel(\"晶体管取以 2 为底的对数\")\n",
    "plt.scatter(x, z, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们看到 𝑧 与时间 𝑥 之间是否存在线性关系，我们用上面写好的方法来求解系数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a, b = cal_a_b(x, z)\n",
    "print('参数 a 的值为：{:g}，参数 b 的值为：{:g}'.format(a, b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(1970, 2005, 1000)\n",
    "z_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"年份\")\n",
    "plt.ylabel(\"晶体管取以 2 为底的对数\")\n",
    "plt.scatter(x, z, c='r')\n",
    "plt.plot(x_predict, z_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
