01.03 机器学习常用的包.ipynb - luxu1011578034644 (7439c2a) - (('Mo Repos',), {'htdigest_file': None, 'use_smarthttp': 0, 'require_browser_auth': 0, 'disable_push': 0, 'unauthenticated_push': 0, 'ctags

01.03 机器学习常用的包.ipynb @7439c2a

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.3 机器学习常用的包"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3.1 `NumPy`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://imgbed.momodel.cn/1200px_NumPy_logo.svg.png\" width=300>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`NumPy（Numerical Python）`是一个开源的 **Python** 科学计算库，用于快速处理任意维度的数组。\n",
    "\n",
    "`NumPy` 支持常见的数组和矩阵操作。\n",
    "\n",
    "对于同样的数值计算任务，使用 `NumPy` 比直接使用 **Python** 要简洁的多。\n",
    "\n",
    "`NumPy` 使用 `ndarray` 对象来处理多维数组，该对象是一个快速而灵活的大数据容器。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `ndarray` 介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`NumPy` 提供了一个`N` 维数组类型 `ndarray`，它描述了**相同类型**的 `items` 的集合。\n",
    "   \n",
    "|语文|数学|英语|政治|体育|\n",
    "|--|--|--|--|--|\n",
    "|80|89|86|67|79|\n",
    "|78|97|89|76|81|\n",
    "\n",
    "用 `ndarray` 进行存储："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建ndarray\n",
    "score = np.array([[80, 89, 86, 67, 79],[78, 97, 89, 67, 81]])\n",
    "\n",
    "# 打印结果\n",
    "score\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  `ndarray` 的属性  \n",
    "数组属性反映了数组本身固有的信息。\n",
    "\n",
    "|属性名字|\t属性解释|\n",
    "|--|--|\n",
    "|ndarray.shape|\t数组维度的元组|\n",
    "|ndarray.ndim|\t数组维数|\n",
    "|ndarray.size|\t数组中的元素数量|\n",
    "|ndarray.itemsize|\t一个数组元素的长度（字节）|\n",
    "|ndarray.dtype|\t数组元素的类型|\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `shape`：数组形状"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建不同形状的数组\n",
    "# 创建不同形状的数组\n",
    "a = np.array([[1,2,3],[4,5,6]])\n",
    "b = np.array([1,2,3,4])\n",
    "c = np.array([\n",
    "    [\n",
    "        [1,2,3],[4,5,6]\n",
    "    ],\n",
    "    [\n",
    "        [1,2,3],[4,5,6]\n",
    "    ]\n",
    "])\n",
    "\n",
    "# 分别打印出形状\n",
    "print(a.shape)\n",
    "print(b.shape)\n",
    "print(c.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `ndim`:数组维数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建不同形状的数组\n",
    "a = np.array([[1,2,3],[4,5,6]])\n",
    "b = np.array([1,2,3,4])\n",
    "c = np.array([[[1,2,3],[4,5,6]], [[1,2,3],[4,5,6]]])\n",
    "\n",
    "# 分别打印出维数\n",
    "print(a.ndim)\n",
    "print(b.ndim)\n",
    "print(c.ndim)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `size`：数组元素数量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建不同形状的数组\n",
    "a = np.array([[1,2,3],[4,5,6]])\n",
    "b = np.array([1,2,3,4])\n",
    "c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])\n",
    "\n",
    "# 分别打印出数组元素数量\n",
    "print(a.size)\n",
    "print(b.size)\n",
    "print(c.size)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `itemsize`：数组元素的长度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建不同形状的数组\n",
    "a = np.array([[1,2,3],[4,5,6]])\n",
    "b = np.array([1,2,3,4])\n",
    "c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,60]]])\n",
    "\n",
    "# 分别打印出数组元素数量\n",
    "print(a.itemsize)\n",
    "print(b.itemsize)\n",
    "print(c.itemsize)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `dtype`：数组元素的类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建不同形状的数组\n",
    "a = np.array([[1,2,3],[4,5,6]])\n",
    "b = np.array([1,2,3,4])\n",
    "c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6.0]]])\n",
    "\n",
    "# 分别打印出数组元素数量\n",
    "print(a.dtype)\n",
    "print(b.dtype)\n",
    "print(c.dtype)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `ndarray` 的类型\n",
    "\n",
    "|名称|\t描述|\t简写|\n",
    "|--|--|--|\n",
    "|np.bool|\t用一个字节存储的布尔类型（True或False）|\t'b'|\n",
    "|np.int8|\t一个字节大小，-128 至 127|\t'i'|\n",
    "|np.int16|\t整数，-32768 至 32767|\t'i2'|\n",
    "|np.int32|\t整数，$-2^{31}$ 至 $2^{32} -1$\t|'i4'|\n",
    "|np.int64|\t整数，$-2^{63}$ 至 $2^{63} - 1$\t|'i8'|\n",
    "|np.uint8|\t无符号整数，0 至 255|\t'u'|\n",
    "|np.uint16\t|无符号整数，0 至 65535|\t'u2'|\n",
    "|np.uint32|\t无符号整数，0 至 $2^{32} - 1$\t|'u4'|\n",
    "|np.uint64|\t无符号整数，0 至 $2^{64} - 1$ |'u8'|\n",
    "|np.float16\t|半精度浮点数：16位，正负号1位，指数5位，精度10位\t|'f2'|\n",
    "|np.float32\t|单精度浮点数：32位，正负号1位，指数8位，精度23位\t|'f4'|\n",
    "|np.float64\t|双精度浮点数：64位，正负号1位，指数11位，精度52位\t|'f8'|\n",
    "|np.complex64\t|复数，分别用两个32位浮点数表示实部和虚部\t|'c8'|\n",
    "|np.complex128\t|复数，分别用两个64位浮点数表示实部和虚部\t|'c16'|\n",
    "|np.object_\t|python对象\t|'O'|\n",
    "|np.string_\t|字符串\t|'S'|\n",
    "|np.unicode_\t|unicode类型\t|'U'|\n",
    "\n",
    "**注意：创建数组的时候指定类型**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# 创建数组时指定类型为 np.float32\n",
    "a = np.array([[1, 2, 3],[4, 5, 6]], dtype=np.float32)\n",
    "\n",
    "# 创建数组时未指定类型\n",
    "b = np.array([[1, 2, 3],[4, 5, 6]])\n",
    "\n",
    "# 打印结果\n",
    "print(\"数组a：\\n%s,\\n数据类型：%s\"%(a,a.dtype))\n",
    "print(\"数组b：\\n%s,\\n数据类型：%s\"%(b,b.dtype))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  基本操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  生成 `0 ` 和 `1` 数组的常见方法 \n",
    "\n",
    "+ 生成 `0` 的数组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "zero = np.zeros([3, 4])\n",
    "zero\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 生成 `1` 的数组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "one = np.ones([3,4])\n",
    "one\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 生成对角数组(对角线的地方是 `1`，其余地方是 `0`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eyes = np.eye(10,5)\n",
    "eyes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 创建方阵对角矩阵"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# np.eye（）输入数据相等则是方阵\n",
    "eyes1 = np.eye(5)\n",
    "eyes1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 从现有数组生成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = [[1,2,3],[4,5,6]]\n",
    "\n",
    "# 从现有的数组当中创建\n",
    "a1 = np.array(a)\n",
    "a\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 生成固定范围的数组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成等间隔的数组\n",
    "a = np.linspace(0, 90, 10)\n",
    "a\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成等间隔的数组\n",
    "b = np.arange(0, 90, 10)\n",
    "b\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 形状修改"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "a = array([[ 0, 1, 2, 3, 4, 5],\n",
    "           [10,11,12,13,14,15],\n",
    "           [20,21,22,23,24,25],\n",
    "           [30,31,32,33,34,35]])\n",
    "a.shape\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 在转换形状的时候，一定要注意数组的元素匹配\n",
    "# 只是将形状进行了修改，但并没有将行列进行转换\n",
    "b = a.reshape([3,8])\n",
    "b\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数组的形状被修改为: (2, 12), -1: 表示通过待计算\n",
    "c = a.reshape([-1,12])\n",
    "c\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = a.T\n",
    "d.shape\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  类型修改"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arr = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]])\n",
    "arr.dtype\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arr.astype(np.float32)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数组去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arr = np.array([[1, 2, 3, 4],[3, 4, 5, 6]])\n",
    "np.unique(arr)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数组运算\n",
    "\n",
    "数组的算术运算是元素级别的操作，新的数组被创建并且被结果填充。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "运算|函数\n",
    "--- | --- \n",
    "`a + b` | `add(a,b)`\n",
    "`a - b` | `subtract(a,b)`\n",
    "`a * b` | `multiply(a,b)`\n",
    "`a / b` | `divide(a,b)`\n",
    "`a ** b` | `power(a,b)`\n",
    "`a % b` | `remainder(a,b)`\n",
    "\n",
    "以乘法为例，数组与标量相乘，相当于数组的每个元素乘以这个标量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "a = np.array([1,2,3,4])\n",
    "a * 3\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数组逐元素相乘："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = np.array([1,2])\n",
    "b = np.array([3,4])\n",
    "a * b\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.multiply(a, b)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "函数还可以接受第三个参数，表示将结果存入第三个参数中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.multiply(a, b, a)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 矩阵  \n",
    "使用 `mat` 方法将 `2` 维数组转化为矩阵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "a = np.array([[1,2,4],\n",
    "              [2,5,3],\n",
    "              [7,8,9]])\n",
    "A = np.mat(a)\n",
    "A\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 也可以使用 **Matlab** 的语法传入一个字符串来生成矩阵：\n",
    "A = np.mat('1,2,4;2,5,3;7,8,9')\n",
    "A\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "矩阵与向量的乘法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([[1], [2], [3]])\n",
    "x\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A*x\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = np.array([[1,2],\n",
    "              [3,4],\n",
    "             [5,6]])\n",
    "B = np.mat(b)\n",
    "A*B\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`A.I` 表示 `A` 矩阵的逆矩阵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A.I\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "矩阵指数表示矩阵连乘："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A ** 4\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 统计函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|方法|作用|\n",
    "|--|--|\n",
    "|`a.sum(axis=None)`|求和|\n",
    "|`a.prod(axis=None)`|求积|\n",
    "|`a.min(axis=None)`|最小值|\n",
    "|`a.max(axis=None)`|最大值|\n",
    "|`a.argmin(axis=None)`|最小值索引|\n",
    "|`a.argmax(axis=None)`|最大值索引|\n",
    "|`a.ptp(axis=None)`|最大值减最小值|\n",
    "|`a.mean(axis=None)`|平均值|\n",
    "|`a.std(axis=None)`|标准差|\n",
    "|`a.var(axis=None)`|方差|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "code_folding": []
   },
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "a = array([[1,2,3],\n",
    "           [4,5,6]])\n",
    "a\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "求所有元素的和："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(a)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.sum()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**指定求和的维度**：\n",
    "沿着第一维求和"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(a, axis=0)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.sum(axis=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "沿着第二维求和："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(a, axis=1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.sum(axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "沿着最后一维求和："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(a, axis=-1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.sum(axis=-1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 比较和逻辑函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "运算符|函数|\n",
    ":---: | :---: \n",
    "`==` | `equal`\n",
    "`!=` | `not_equal`\n",
    "`>` | `greater`\n",
    "`>=` | `greater_equal`\n",
    "`<` | `less`\n",
    "`<=` | `less_equal`\n",
    "\n",
    "数组元素的比对，我们可以直接使用运算符进行比较，比如判断数组中元素是否大于某个数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "a = array([[ 0, 1, 2, 3, 4, 5],\n",
    "           [10,11,12,13,14,15],\n",
    "           [20,21,22,23,24,25],\n",
    "           [30,31,32,33,34,35]])\n",
    "\n",
    "a > 10\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 判断数组中元素大于10的元素赋值为 -10 \n",
    "a[a > 10] = -10\n",
    "a\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是当数组元素较多时，查看输出结果便变得很麻烦，这时我们可以使用`all（）`方法，直接比对矩阵的所有对应的元素是否满足条件。假如判断某个区间的值是否全是大于 `20`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "a = array([[ 0, 1, 2, 3, 4, 5],\n",
    "           [10,11,12,13,14,15],\n",
    "           [20,21,22,23,24,25],\n",
    "           [30,31,32,33,34,35]])\n",
    "\n",
    "a[1:3,1:3]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.all(a[1:4,1:3] > 20)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "比如判断数组某个区间的元素是否存在大于 `20`的元素:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.any(a[1:4,1:3] > 20)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `IO` 操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`savetxt` 可以将数组写入文件，默认使用科学计数法的形式保存："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "data = np.array([[1,2],\n",
    "                 [3,4]])\n",
    "\n",
    "# 保存文件\n",
    "np.savetxt('out.txt', data)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取文件\n",
    "with open('out.txt') as f:\n",
    "    for line in f:\n",
    "        print(line)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取文件\n",
    "np.loadtxt('out.txt')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3.2 `Pandas`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://pandas.pydata.org/_static/pandas_logo.png\" width=300/>\n",
    "\n",
    "+ `Pandas` 是基于 `NumPy` 的一种工具,该工具是为了解决数据分析任务而创建的\n",
    "+ `Pandas` 纳入了大量库及一些标准的数据模型，提供了高效的操作大型数据集所需要的工具\n",
    "+ `Pandas` 提供了大量能使我们快速便捷地处理数据的函数与方法\n",
    "+ 是 **Python** 成为强大而高效的数据分析环境的重要因素之一\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 产生 `Pandas` 对象"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas` 主要有两种基本的数据结构：\n",
    "\n",
    "- `Series`\n",
    "    - `Series` 是带索引的一维数组，可存储整数、浮点数、字符串、**Python** 对象等类型的数据。\n",
    "- `DataFrame`\n",
    "    - `DataFrame` 是由多种类型的列构成的二维标签数据结构，类似于 `Excel` 、`SQL` 表，或 `Series` 对象构成的字典。`DataFrame` 是最常用的 `Pandas` 对象。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成 series\n",
    "s = pd.Series([1,3,5,np.nan,6,8])\n",
    "\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成 dataframe \n",
    "dates = pd.date_range('20200101', periods=15)\n",
    "\n",
    "df = pd.DataFrame(np.random.randn(15,4), index=dates, columns=list('ABCD'))\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "默认情况下，如果不指定 `index` 参数和 `columns`，那么他们的值将用从 `0` 开始的数字替代。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "写入 `csv` 文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('foo.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取 `csv` 文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.read_csv('foo.csv',index_col=0)\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`head` 和 `tail` 方法可以分别查看最前面几行和最后面几行的数据（默认为 `5`）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "了解更多`Pandas`内容，可以参考：https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3.3 `Matplotlib`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://matplotlib.org/_static/logo2.svg\" width=300/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "简单来说，`Matplotlib` 是 **Python** 的一个绘图库。它包含了大量的工具，你可以使用这些工具创建各种图形，包括简单的散点图，正弦曲线，甚至是三维图形。\n",
    "\n",
    "**Python** 科学计算社区经常使用它完成数据可视化的工作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 画一个简单的图形"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 简单的绘图\n",
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "\n",
    "# 如果没有第一个参数 x，图形的 x 坐标默认为数组的索引\n",
    "plt.plot(x, np.sin(x)) \n",
    "\n",
    "# 显示图形\n",
    "plt.show() \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 在一张图上绘制两条曲线"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "plt.plot(x, np.sin(x),\n",
    "         x, np.cos(x))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 自定义曲线的外观"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "plt.plot(x, np.sin(x), 'r-^',\n",
    "         x, np.cos(x), 'g--')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **颜色**： \n",
    "    - 蓝色 - 'b' \n",
    "    - 绿色 - 'g' \n",
    "    - 红色 - 'r' \n",
    "    - 青色 - 'c' \n",
    "    - 品红 - 'm' \n",
    "    - 黄色 - 'y' \n",
    "    - 黑色 - 'k'（'b'代表蓝色，所以这里用黑色的最后一个字母） \n",
    "    - 白色 - 'w'\n",
    "\n",
    "- 线： \n",
    "    - 直线 - '-' \n",
    "    - 虚线 - '--' \n",
    "    - 点线 - ':' \n",
    "    - 点划线 - '-.'\n",
    "\n",
    "- 常用点标记:\n",
    "    - 点 - '.' \n",
    "    - 像素 - ',' \n",
    "    - 圆 - 'o' \n",
    "    - 方形 - 's' \n",
    "    - 三角形 - '^' \n",
    "    \n",
    "可以在[这里](http://matplotlib.org/api/markers_api.html)查看更多的样式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用子图"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用子图可以在一个窗口绘制多张图。在调用 `plot()` 函数之前需要先调用 `subplot()` 函数。该函数的第一个参数代表子图的总行数，第二个参数代表子图的总列数，第三个参数代表活跃区域。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "plt.subplot(2, 1, 1) # （行，列，活跃区）\n",
    "plt.plot(x, np.sin(x), 'r')\n",
    "plt.subplot(2, 1, 2)\n",
    "plt.plot(x, np.cos(x), 'g')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 散点图"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "散点图是一堆离散点的集合。用 `Matplotlib` 画散点图也同样非常简单。只需要调用 `scatter()` 函数并传入两个分别代表 `x` 坐标和 `y` 坐标的数组即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 简单的散点图\n",
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "y = np.sin(x)\n",
    "plt.scatter(x,y)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 调整点的大小和颜色"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以给每个点赋予不同的大小"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.rand(100)\n",
    "y = np.random.rand(100)\n",
    "size = np.random.rand(100) * 50\n",
    "plt.scatter(x, y, size)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以给每个点赋予不同颜色。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.rand(100)\n",
    "y = np.random.rand(100)\n",
    "size = np.random.rand(100) * 50\n",
    "color = np.random.rand(100)\n",
    "plt.scatter(x, y, size, color)\n",
    "plt.colorbar()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 直方图"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用 `hist()` 函数可以非常方便的创建直方图。第二个参数代表分段的个数。分段越多，图形上的数据条就越多。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.randn(1000)\n",
    "plt.hist(x, 50)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 标题，标签和图例"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当需要快速创建图形时，你可能不需要为图形添加标签。但是当构建需要展示的图形时，你就需要添加标题，标签和图例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0, 2 * np.pi, 50)\n",
    "plt.plot(x, np.sin(x), 'r-x', label='Sin(x)')\n",
    "plt.plot(x, np.cos(x), 'g-^', label='Cos(x)')\n",
    "\n",
    "# 展示图例\n",
    "plt.legend()\n",
    "\n",
    "# 给 x 轴添加标签\n",
    "plt.xlabel('Rads')\n",
    "\n",
    "# 给 y 轴添加标签\n",
    "plt.ylabel('Amplitude')\n",
    "\n",
    "# 添加图形标题\n",
    "plt.title('Sin and Cos Waves')\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 图片保存"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fruits = ['apple', 'orange', 'pear']\n",
    "sales = [100,250,300]\n",
    "plt.pie(sales, labels=fruits)\n",
    "plt.savefig('pie.png')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以在这里查看更多的[图例](https://matplotlib.org/gallery.html)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Seaborn` 基于 `matplotlib`， 可以快速的绘制一些统计图表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "sns.set()\n",
    "iris = pd.read_csv(\"iris.csv\")\n",
    "sns.jointplot(x=\"sepal_length\", y=\"petal_length\", data=iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(data=iris, hue=\"species\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以在这里查看更多的[示例](https://seaborn.pydata.org/tutorial.html)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3.4 `Scikit-learn`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://imgbed.momodel.cn/scikitlearn.png\" width=300 />\n",
    "\n",
    "+ **Python** 语言的机器学习工具\n",
    "+ `Scikit-learn` 包括大量常用的机器学习算法\n",
    "+ `Scikit-learn` 文档完善，容易上手"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 机器学习算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法**。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://imgbed.momodel.cn/q2nay75zew.png\" width=800>\n",
    "\n",
    "由图中，可以看到机器学习 `sklearn` 库的算法主要有四类：分类，回归，聚类，降维。其中：\n",
    "\n",
    "+ 常用的回归：线性、决策树、`SVM`、`KNN` ；  \n",
    "    集成回归：随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
    "+ 常用的分类：线性、决策树、`SVM`、`KNN`，朴素贝叶斯；  \n",
    "    集成分类：随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
    "+ 常用聚类：`k` 均值（`K-means`）、层次聚类（`Hierarchical clustering`）、`DBSCAN` \n",
    "+ 常用降维：`LinearDiscriminantAnalysis`、`PCA`   　　\n",
    "\n",
    "这个流程图代表：蓝色圆圈是判断条件，绿色方框是可以选择的算法，我们可以根据自己的数据特征和任务目标去找一条自己的操作路线。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `sklearn` 数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ `sklearn.datasets.load_*()`\n",
    "    + 获取小规模数据集，数据包含在 `datasets` 里\n",
    "+ `sklearn.datasets.fetch_*(data_home=None)`\n",
    "    + 获取大规模数据集，需要从网络上下载，函数的第一个参数是 `data_home`，表示数据集下载的目录,默认是 `/scikit_learn_data/`\n",
    "    \n",
    "`sklearn` 常见的数据集如下：\n",
    "\n",
    "||数据集名称|调用方式|适用算法|数据规模|\n",
    "|--|--|--|--|--|\n",
    "|小数据集|波士顿房价|load_boston()|回归|506\\*13|\n",
    "|小数据集|鸢尾花数据集|load_iris()|分类|150\\*4|\n",
    "|小数据集|糖尿病数据集|\tload_diabetes()|\t回归\t|442\\*10|\n",
    "|大数据集|手写数字数据集|\tload_digits()|\t分类|\t5620\\*64|\n",
    "|大数据集|Olivetti脸部图像数据集|\tfetch_olivetti_facecs|\t降维|\t400\\*64\\*64|\n",
    "|大数据集|新闻分类数据集|\tfetch_20newsgroups()|\t分类|-|\t \n",
    "|大数据集|带标签的人脸数据集|\tfetch_lfw_people()|\t分类、降维|-|\t \n",
    "|大数据集|路透社新闻语料数据集|\tfetch_rcv1()|\t分类|\t804414\\*47236|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "print(\"鸢尾花数据集的返回值：\\n\", iris.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据预处理\n",
    "\n",
    "通过**一些转换函数**将特征数据转换成**更加适合算法模型**的特征数据过程。常见的有数据标准化、数据二值化、标签编码、独热编码等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入内建数据集\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "\n",
    "# 获得ndarray格式的变量X和标签y\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "# 获得数据维度\n",
    "n_samples, n_features = iris.data.shape\n",
    "\n",
    "print(n_samples, n_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据标准化\n",
    "\n",
    "数据标准化和归一化是将数据映射到一个小的浮点数范围内，以便模型能快速收敛。\n",
    "\n",
    "标准化有多种方式，常用的一种是min-max标准化（对象名为MinMaxScaler），该方法使数据落到[0,1]区间：\n",
    "\n",
    "$x^{'}=\\frac{x-x_{min}}{x_{max} - x_{min}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# min-max标准化\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "sc = MinMaxScaler()\n",
    "sc.fit(X)\n",
    "results = sc.transform(X)\n",
    "print(\"放缩前：\",X[1])\n",
    "print(\"放缩后：\",results[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一种是Z-score标准化（对象名为StandardScaler），该方法使数据满足标准正态分布：\n",
    "\n",
    "$x^{'}=\\frac{x-\\overline {X}}{S}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Z-score标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "#将fit和transform组合执行\n",
    "results = StandardScaler().fit_transform(X) \n",
    "\n",
    "print(\"放缩前：\",X[1])\n",
    "print(\"放缩后：\",results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "归一化（对象名为Normalizer，默认为L2归一化）：\n",
    "\n",
    "$x^{'}=\\frac{x}{\\sqrt{\\sum_{j}^{m}x_{j}^2}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 归一化\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "results = Normalizer().fit_transform(X)\n",
    "\n",
    "print(\"放缩前：\",X[1])\n",
    "print(\"放缩后：\",results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据二值化\n",
    "\n",
    "使用阈值过滤器将数据转化为布尔值，即为二值化。使用Binarizer对象实现数据的二值化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 二值化，阈值设置为3\n",
    "from sklearn.preprocessing import Binarizer\n",
    "\n",
    "results = Binarizer(threshold=3).fit_transform(X)\n",
    "\n",
    "print(\"处理前：\",X[1])\n",
    "print(\"处理后：\",results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 标签编码\n",
    "\n",
    "使用 LabelEncoder 将不连续的数值或文本变量转化为有序的数值型变量：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 标签编码\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "LabelEncoder().fit_transform(['apple','pear','orange','banana'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 独热编码\n",
    "\n",
    "对于无序的离散型特征，其数值大小并没有意义，需要对其进行one-hot编码，将其特征的m个可能值转化为m个二值化特征。可以利用OneHotEncoder对象实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 独热编码\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "results = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()\n",
    "\n",
    "print(\"处理前：\",X[1])\n",
    "print(\"处理后：\",results[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据集的划分\n",
    "\n",
    "机器学习一般的数据集会划分为两个部分：\n",
    "+ 训练数据：用于训练，构建模型\n",
    "+ 测试数据：在模型检验时使用，用于评估模型是否有效\n",
    "\n",
    "<br>\n",
    "\n",
    "划分比例：\n",
    "+ 训练集：70% 80% 75%\n",
    "+ 测试集：30% 20% 25%\n",
    "\n",
    "<br>\n",
    "`sklearn.model_selection.train_test_split(x, y, test_size, random_state )`\n",
    "   +  `x`：数据集的特征值\n",
    "   +  `y`： 数据集的标签值\n",
    "   +  `test_size`： 如果是浮点数，表示测试集样本占比；如果是整数，表示测试集样本的数量。\n",
    "   +  `random_state`： 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。\n",
    "   +  `return` 训练集的特征值 `x_train` 测试集的特征值 `x_test` 训练集的目标值 `y_train` 测试集的目标值 `y_test`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 加载数据集\n",
    "iris = load_iris()\n",
    "\n",
    "# 对数据集进行分割\n",
    "# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=22)\n",
    "\n",
    "print(\"x_train:\", X_train.shape)\n",
    "print(\"y_train:\", y_train.shape)\n",
    "print(\"x_test:\", X_test.shape)\n",
    "print(\"y_test:\", y_test.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 估计器（`Estimator`）\n",
    "估计器，很多时候可以直接理解成分类器，主要包含两个函数：\n",
    "\n",
    "+ `fit()`：训练算法，设置内部参数。接收训练集和类别两个参数。\n",
    "+ `predict()`：预测测试集类别，参数为测试集。\n",
    "\n",
    "大多数 `scikit-learn` 估计器接收和输出的数据格式均为 `NumPy`数组或类似格式。\n",
    "\n",
    "<br>\n",
    "\n",
    "#### 转换器（`Transformer`）  \n",
    "转换器用于数据预处理和数据转换，主要是三个方法：\n",
    "\n",
    "+ `fit()`：训练算法，设置内部参数。\n",
    "+ `transform()`：数据转换。\n",
    "+ `fit_transform()`：合并 `fit` 和 `transform` 两个方法。\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在 `scikit-learn` 中，所有模型都有同样的接口供调用。监督学习模型都具有以下的方法：\n",
    "+ `fit`：对数据进行拟合。\n",
    "+ `set_params`：设定模型参数。\n",
    "+ `get_params`：返回模型参数。\n",
    "+ `predict`：在指定的数据集上预测。\n",
    "+ `score`：返回预测器的得分。\n",
    "\n",
    "鸢尾花数据集是一个分类任务，故以决策树模型为例，采用默认参数拟合模型，并对验证集预测。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 决策树分类器\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# 定义模型\n",
    "model = DecisionTreeClassifier()\n",
    "\n",
    "# 训练模型\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# 在测试集上预测\n",
    "model.predict(X_test)\n",
    "\n",
    "# 测试集上的得分（默认为准确率）\n",
    "model.score(X_test, y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`scikit-learn` 中所有模型的调用方式都类似。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模型评估\n",
    "\n",
    "评估模型的常用方法为 `K` 折交叉验证，它将数据集划分为 `K` 个大小相近的子集（`K` 通常取 `10`），每次选择其中(`K-1`)个子集的并集做为训练集，余下的做为测试集，总共得到 `K` 组训练集&测试集，最终返回这 `K` 次测试结果的得分，取其均值可作为选定最终模型的指标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 交叉验证\n",
    "from sklearn.model_selection import cross_val_score\n",
    "cross_val_score(model, X, y, scoring=None, cv=10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意：由于之前采用了 `train_test_split` 分割数据集，它默认对数据进行了洗牌，所以这里可以直接使用 `cv=10` 来进行 `10` 折交叉验证（`cross_val_score` 不会对数据进行洗牌）。如果之前未对数据进行洗牌，则要搭配使用 `KFold` 模块："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold\n",
    "n_folds = 10\n",
    "kf = KFold(n_folds, shuffle=True).get_n_splits(X)\n",
    "cross_val_score(model, X, y, scoring=None, cv = kf)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 保存与加载模型\n",
    "\n",
    "在训练模型后可将模型保存，以免下次重复训练。保存与加载模型使用 `sklearn` 的 `joblib`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.externals import joblib\n",
    "\n",
    "# 保存模型\n",
    "joblib.dump(model,'myModel.pkl')\n",
    "\n",
    "# 加载模型\n",
    "model=joblib.load('myModel.pkl')\n",
    "print(model)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面我们用一个小例子来展示如何使用 `sklearn` 工具包快速完成一个机器学习项目。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 采用逻辑回归模型实现鸢尾花分类\n",
    "\n",
    "\n",
    "**线性回归**\n",
    "\n",
    "在介绍逻辑回归之前先介绍一下线性回归，线性回归的主要思想是通过历史数据拟合出一条直线，因变量与自变量是线性关系，对新的数据用这条直线进行预测。 线性回归的公式如下：\n",
    "\n",
    "$y = w_{0}+w_{1}x_{1}+...+w_{n}x_{n}=w^{T}x+b$\n",
    "\n",
    "**逻辑回归**\n",
    "\n",
    "逻辑回归是一种广义的线性回归分析模型，是一种预测分析。虽然它名字里带回归，但实际上是一种分类学习方法。它不是仅预测出“类别”， 而是可以得到近似概率预测，这对于许多需要利用概率辅助决策的任务很有用。普遍应用于预测一个实例是否属于一个特定类别的概率，比如一封 `email` 是垃圾邮件的概率是多少。 因变量可以是二分类的，也可以是多分类的。因为结果是概率的，除了分类外还可以做 `ranking model`。逻辑的应用场景很多，如点击率预测（`CTR`）、天气预测、一些电商的购物搭配推荐、一些电商的搜索排序基线等。\n",
    "\n",
    "`sigmoid` **函数**\n",
    "\n",
    "`Sigmoid` 函数，呈现S型曲线，它将值转化为一个接近 `0` 或 `1` 的 `y` 值。  \n",
    "$y = g(z)=\\frac{1}{1+e^{-z}}$   其中:$z = w^{T}x+b$ \n",
    "\n",
    "\n",
    "**鸢尾花数据集**\n",
    "\n",
    "`sklearn.datasets.load_iris()`:加载并返回鸢尾花数据集\n",
    "\n",
    "`Iris` 鸢尾花卉数据集,是常用的分类实验数据集，由 `R.A. Fisher` 于 `1936` 年收集整理的。其中包含 `3` 种植物种类，分别是山鸢尾（`setosa`）变色鸢尾（`versicolor`）和维吉尼亚鸢尾（`virginica`），每类 `50` 个样本，共 `150` 个样本。  \n",
    "\n",
    "|变量名|\t变量解释|\t数据类型|\n",
    "|--|--|--|\n",
    "|sepal_length|\t花萼长度（单位cm)|\tnumeric|\n",
    "|sepal_width|\t花萼宽度（单位cm）|\tnumeric|\n",
    "|petal_length\t|花瓣长度（单位cm）|\tnumeric|\n",
    "|petal_width|\t花瓣宽度（单位cm）|\tnumeric|\n",
    "|species\t|种类\t|categorical|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.获取数据集及其信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "print(\"鸢尾花数据集的返回值：\\n\", iris.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"鸢尾花的特征值:\\n\", iris[\"data\"][1])\n",
    "print(\"鸢尾花的目标值：\\n\", iris.target)\n",
    "print(\"鸢尾花特征的名字：\\n\", iris.feature_names)\n",
    "print(\"鸢尾花目标值的名字：\\n\", iris.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 取出特征值\n",
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.数据划分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2.数据划分\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train,X_test,Y_train,Y_test = train_test_split(X, y, test_size=0.1, random_state=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "transfer  = StandardScaler()\n",
    "X_train = transfer.fit_transform(X_train)\n",
    "X_test = transfer.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.模型构建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "estimator  = LogisticRegression(penalty='l2',solver='newton-cg',multi_class='multinomial')\n",
    "estimator.fit(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.模型评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 5.模型评估\n",
    "print(\"\\n得出来的权重：\", estimator.coef_)\n",
    "print(\"\\nLogistic Regression模型训练集的准确率：%.1f%%\" %(estimator.score(X_train, Y_train)*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. 模型预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "y_predict = estimator.predict(X_test)\n",
    "print(\"\\n预测结果为:\\n\", y_predict)\n",
    "print(\"\\n比对真实值和预测值：\\n\", y_predict == Y_test)\n",
    "\n",
    "# 预测的准确率\n",
    "accuracy = metrics.accuracy_score(Y_test, y_predict)\n",
    "print(\"\\nLogistic Regression 模型测试集的正确率：%.1f%%\" %(accuracy*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.交叉验证"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "import numpy as np\n",
    "scores = cross_val_score(estimator, X, y, scoring=None, cv=10)  #cv为迭代次数。\n",
    "print(\"\\n交叉验证的准确率:\",np.round(scores,2))  # 打印输出每次迭代的度量值（准确度）\n",
    "print(\"\\n交叉验证结果的置信区间: %0.2f%%(+/- %0.2f)\" % (scores.mean()*100, scores.std() * 2))  # 获取置信区间。（也就是均值和方差）\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}