{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.3 机器学习常用的包"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3.1 `NumPy`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://imgbed.momodel.cn/1200px_NumPy_logo.svg.png\" width=300>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`NumPy(Numerical Python)`是一个开源的 **Python** 科学计算库,用于快速处理任意维度的数组。\n",
"\n",
"`NumPy` 支持常见的数组和矩阵操作。\n",
"\n",
"对于同样的数值计算任务,使用 `NumPy` 比直接使用 **Python** 要简洁的多。\n",
"\n",
"`NumPy` 使用 `ndarray` 对象来处理多维数组,该对象是一个快速而灵活的大数据容器。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `ndarray` 介绍"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`NumPy` 提供了一个`N` 维数组类型 `ndarray`,它描述了**相同类型**的 `items` 的集合。\n",
" \n",
"|语文|数学|英语|政治|体育|\n",
"|--|--|--|--|--|\n",
"|80|89|86|67|79|\n",
"|78|97|89|76|81|\n",
"\n",
"用 `ndarray` 进行存储:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建ndarray\n",
"score = np.array([[80, 89, 86, 67, 79],[78, 97, 89, 67, 81]])\n",
"\n",
"# 打印结果\n",
"score\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `ndarray` 的属性 \n",
"数组属性反映了数组本身固有的信息。\n",
"\n",
"|属性名字|\t属性解释|\n",
"|--|--|\n",
"|ndarray.shape|\t数组维度的元组|\n",
"|ndarray.ndim|\t数组维数|\n",
"|ndarray.size|\t数组中的元素数量|\n",
"|ndarray.itemsize|\t一个数组元素的长度(字节)|\n",
"|ndarray.dtype|\t数组元素的类型|\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `shape`:数组形状"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建不同形状的数组\n",
"# 创建不同形状的数组\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = np.array([1,2,3,4])\n",
"c = np.array([\n",
" [\n",
" [1,2,3],[4,5,6]\n",
" ],\n",
" [\n",
" [1,2,3],[4,5,6]\n",
" ]\n",
"])\n",
"\n",
"# 分别打印出形状\n",
"print(a.shape)\n",
"print(b.shape)\n",
"print(c.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `ndim`:数组维数"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建不同形状的数组\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = np.array([1,2,3,4])\n",
"c = np.array([[[1,2,3],[4,5,6]], [[1,2,3],[4,5,6]]])\n",
"\n",
"# 分别打印出维数\n",
"print(a.ndim)\n",
"print(b.ndim)\n",
"print(c.ndim)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `size`:数组元素数量"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建不同形状的数组\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = np.array([1,2,3,4])\n",
"c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])\n",
"\n",
"# 分别打印出数组元素数量\n",
"print(a.size)\n",
"print(b.size)\n",
"print(c.size)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `itemsize`:数组元素的长度"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建不同形状的数组\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = np.array([1,2,3,4])\n",
"c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,60]]])\n",
"\n",
"# 分别打印出数组元素数量\n",
"print(a.itemsize)\n",
"print(b.itemsize)\n",
"print(c.itemsize)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `dtype`:数组元素的类型"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建不同形状的数组\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = np.array([1,2,3,4])\n",
"c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6.0]]])\n",
"\n",
"# 分别打印出数组元素数量\n",
"print(a.dtype)\n",
"print(b.dtype)\n",
"print(c.dtype)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `ndarray` 的类型\n",
"\n",
"|名称|\t描述|\t简写|\n",
"|--|--|--|\n",
"|np.bool|\t用一个字节存储的布尔类型(True或False)|\t'b'|\n",
"|np.int8|\t一个字节大小,-128 至 127|\t'i'|\n",
"|np.int16|\t整数,-32768 至 32767|\t'i2'|\n",
"|np.int32|\t整数,$-2^{31}$ 至 $2^{32} -1$\t|'i4'|\n",
"|np.int64|\t整数,$-2^{63}$ 至 $2^{63} - 1$\t|'i8'|\n",
"|np.uint8|\t无符号整数,0 至 255|\t'u'|\n",
"|np.uint16\t|无符号整数,0 至 65535|\t'u2'|\n",
"|np.uint32|\t无符号整数,0 至 $2^{32} - 1$\t|'u4'|\n",
"|np.uint64|\t无符号整数,0 至 $2^{64} - 1$ |'u8'|\n",
"|np.float16\t|半精度浮点数:16位,正负号1位,指数5位,精度10位\t|'f2'|\n",
"|np.float32\t|单精度浮点数:32位,正负号1位,指数8位,精度23位\t|'f4'|\n",
"|np.float64\t|双精度浮点数:64位,正负号1位,指数11位,精度52位\t|'f8'|\n",
"|np.complex64\t|复数,分别用两个32位浮点数表示实部和虚部\t|'c8'|\n",
"|np.complex128\t|复数,分别用两个64位浮点数表示实部和虚部\t|'c16'|\n",
"|np.object_\t|python对象\t|'O'|\n",
"|np.string_\t|字符串\t|'S'|\n",
"|np.unicode_\t|unicode类型\t|'U'|\n",
"\n",
"**注意:创建数组的时候指定类型**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# 创建数组时指定类型为 np.float32\n",
"a = np.array([[1, 2, 3],[4, 5, 6]], dtype=np.float32)\n",
"\n",
"# 创建数组时未指定类型\n",
"b = np.array([[1, 2, 3],[4, 5, 6]])\n",
"\n",
"# 打印结果\n",
"print(\"数组a:\\n%s,\\n数据类型:%s\"%(a,a.dtype))\n",
"print(\"数组b:\\n%s,\\n数据类型:%s\"%(b,b.dtype))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基本操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 生成 `0 ` 和 `1` 数组的常见方法 \n",
"\n",
"+ 生成 `0` 的数组"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"zero = np.zeros([3, 4])\n",
"zero\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ 生成 `1` 的数组"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"one = np.ones([3,4])\n",
"one\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ 生成对角数组(对角线的地方是 `1`,其余地方是 `0`)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eyes = np.eye(10,5)\n",
"eyes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ 创建方阵对角矩阵"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# np.eye()输入数据相等则是方阵\n",
"eyes1 = np.eye(5)\n",
"eyes1\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 从现有数组生成"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a = [[1,2,3],[4,5,6]]\n",
"\n",
"# 从现有的数组当中创建\n",
"a1 = np.array(a)\n",
"a\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a1\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 生成固定范围的数组"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 生成等间隔的数组\n",
"a = np.linspace(0, 90, 10)\n",
"a\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 生成等间隔的数组\n",
"b = np.arange(0, 90, 10)\n",
"b\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 形状修改"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from numpy import array\n",
"a = array([[ 0, 1, 2, 3, 4, 5],\n",
" [10,11,12,13,14,15],\n",
" [20,21,22,23,24,25],\n",
" [30,31,32,33,34,35]])\n",
"a.shape\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 在转换形状的时候,一定要注意数组的元素匹配\n",
"# 只是将形状进行了修改,但并没有将行列进行转换\n",
"b = a.reshape([3,8])\n",
"b\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 数组的形状被修改为: (2, 12), -1: 表示通过待计算\n",
"c = a.reshape([-1,12])\n",
"c\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"d = a.T\n",
"d.shape\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 类型修改"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"arr = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]])\n",
"arr.dtype\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"arr.astype(np.float32)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 数组去重"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"arr = np.array([[1, 2, 3, 4],[3, 4, 5, 6]])\n",
"np.unique(arr)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数组运算\n",
"\n",
"数组的算术运算是元素级别的操作,新的数组被创建并且被结果填充。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"运算|函数\n",
"--- | --- \n",
"`a + b` | `add(a,b)`\n",
"`a - b` | `subtract(a,b)`\n",
"`a * b` | `multiply(a,b)`\n",
"`a / b` | `divide(a,b)`\n",
"`a ** b` | `power(a,b)`\n",
"`a % b` | `remainder(a,b)`\n",
"\n",
"以乘法为例,数组与标量相乘,相当于数组的每个元素乘以这个标量:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"a = np.array([1,2,3,4])\n",
"a * 3\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数组逐元素相乘:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a = np.array([1,2])\n",
"b = np.array([3,4])\n",
"a * b\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用函数"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.multiply(a, b)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"函数还可以接受第三个参数,表示将结果存入第三个参数中:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.multiply(a, b, a)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 矩阵 \n",
"使用 `mat` 方法将 `2` 维数组转化为矩阵:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"a = np.array([[1,2,4],\n",
" [2,5,3],\n",
" [7,8,9]])\n",
"A = np.mat(a)\n",
"A\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 也可以使用 **Matlab** 的语法传入一个字符串来生成矩阵:\n",
"A = np.mat('1,2,4;2,5,3;7,8,9')\n",
"A\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"矩阵与向量的乘法:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.array([[1], [2], [3]])\n",
"x\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A*x\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"b = np.array([[1,2],\n",
" [3,4],\n",
" [5,6]])\n",
"B = np.mat(b)\n",
"A*B\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`A.I` 表示 `A` 矩阵的逆矩阵:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A.I\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"矩阵指数表示矩阵连乘:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A ** 4\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 统计函数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|方法|作用|\n",
"|--|--|\n",
"|`a.sum(axis=None)`|求和|\n",
"|`a.prod(axis=None)`|求积|\n",
"|`a.min(axis=None)`|最小值|\n",
"|`a.max(axis=None)`|最大值|\n",
"|`a.argmin(axis=None)`|最小值索引|\n",
"|`a.argmax(axis=None)`|最大值索引|\n",
"|`a.ptp(axis=None)`|最大值减最小值|\n",
"|`a.mean(axis=None)`|平均值|\n",
"|`a.std(axis=None)`|标准差|\n",
"|`a.var(axis=None)`|方差|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"code_folding": []
},
"outputs": [],
"source": [
"from numpy import array\n",
"a = array([[1,2,3],\n",
" [4,5,6]])\n",
"a\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"求所有元素的和:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sum(a)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a.sum()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**指定求和的维度**:\n",
"沿着第一维求和"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.sum(a, axis=0)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a.sum(axis=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"沿着第二维求和:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.sum(a, axis=1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a.sum(axis=1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"沿着最后一维求和:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.sum(a, axis=-1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a.sum(axis=-1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 比较和逻辑函数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"运算符|函数|\n",
":---: | :---: \n",
"`==` | `equal`\n",
"`!=` | `not_equal`\n",
"`>` | `greater`\n",
"`>=` | `greater_equal`\n",
"`<` | `less`\n",
"`<=` | `less_equal`\n",
"\n",
"数组元素的比对,我们可以直接使用运算符进行比较,比如判断数组中元素是否大于某个数:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from numpy import array\n",
"a = array([[ 0, 1, 2, 3, 4, 5],\n",
" [10,11,12,13,14,15],\n",
" [20,21,22,23,24,25],\n",
" [30,31,32,33,34,35]])\n",
"\n",
"a > 10\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 判断数组中元素大于10的元素赋值为 -10 \n",
"a[a > 10] = -10\n",
"a\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"但是当数组元素较多时,查看输出结果便变得很麻烦,这时我们可以使用`all()`方法,直接比对矩阵的所有对应的元素是否满足条件。假如判断某个区间的值是否全是大于 `20`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from numpy import array\n",
"a = array([[ 0, 1, 2, 3, 4, 5],\n",
" [10,11,12,13,14,15],\n",
" [20,21,22,23,24,25],\n",
" [30,31,32,33,34,35]])\n",
"\n",
"a[1:3,1:3]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.all(a[1:4,1:3] > 20)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"比如判断数组某个区间的元素是否存在大于 `20`的元素:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.any(a[1:4,1:3] > 20)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `IO` 操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`savetxt` 可以将数组写入文件,默认使用科学计数法的形式保存:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"data = np.array([[1,2],\n",
" [3,4]])\n",
"\n",
"# 保存文件\n",
"np.savetxt('out.txt', data)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 读取文件\n",
"with open('out.txt') as f:\n",
" for line in f:\n",
" print(line)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 读取文件\n",
"np.loadtxt('out.txt')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3.2 `Pandas`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://pandas.pydata.org/_static/pandas_logo.png\" width=300/>\n",
"\n",
"+ `Pandas` 是基于 `NumPy` 的一种工具,该工具是为了解决数据分析任务而创建的\n",
"+ `Pandas` 纳入了大量库及一些标准的数据模型,提供了高效的操作大型数据集所需要的工具\n",
"+ `Pandas` 提供了大量能使我们快速便捷地处理数据的函数与方法\n",
"+ 是 **Python** 成为强大而高效的数据分析环境的重要因素之一\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 产生 `Pandas` 对象"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas` 主要有两种基本的数据结构:\n",
"\n",
"- `Series`\n",
" - `Series` 是带索引的一维数组,可存储整数、浮点数、字符串、**Python** 对象等类型的数据。\n",
"- `DataFrame`\n",
" - `DataFrame` 是由多种类型的列构成的二维标签数据结构,类似于 `Excel` 、`SQL` 表,或 `Series` 对象构成的字典。`DataFrame` 是最常用的 `Pandas` 对象。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 生成 series\n",
"s = pd.Series([1,3,5,np.nan,6,8])\n",
"\n",
"print(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 生成 dataframe \n",
"dates = pd.date_range('20200101', periods=15)\n",
"\n",
"df = pd.DataFrame(np.random.randn(15,4), index=dates, columns=list('ABCD'))\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"默认情况下,如果不指定 `index` 参数和 `columns`,那么他们的值将用从 `0` 开始的数字替代。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"写入 `csv` 文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('foo.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"读取 `csv` 文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.read_csv('foo.csv',index_col=0)\n",
"df1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`head` 和 `tail` 方法可以分别查看最前面几行和最后面几行的数据(默认为 `5`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1.tail(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"了解更多`Pandas`内容,可以参考:https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3.3 `Matplotlib`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://matplotlib.org/_static/logo2.svg\" width=300/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"简单来说,`Matplotlib` 是 **Python** 的一个绘图库。它包含了大量的工具,你可以使用这些工具创建各种图形,包括简单的散点图,正弦曲线,甚至是三维图形。\n",
"\n",
"**Python** 科学计算社区经常使用它完成数据可视化的工作。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 画一个简单的图形"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 简单的绘图\n",
"x = np.linspace(0, 2 * np.pi, 50)\n",
"\n",
"# 如果没有第一个参数 x,图形的 x 坐标默认为数组的索引\n",
"plt.plot(x, np.sin(x)) \n",
"\n",
"# 显示图形\n",
"plt.show() \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 在一张图上绘制两条曲线"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0, 2 * np.pi, 50)\n",
"plt.plot(x, np.sin(x),\n",
" x, np.cos(x))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 自定义曲线的外观"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0, 2 * np.pi, 50)\n",
"plt.plot(x, np.sin(x), 'r-^',\n",
" x, np.cos(x), 'g--')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **颜色**: \n",
" - 蓝色 - 'b' \n",
" - 绿色 - 'g' \n",
" - 红色 - 'r' \n",
" - 青色 - 'c' \n",
" - 品红 - 'm' \n",
" - 黄色 - 'y' \n",
" - 黑色 - 'k'('b'代表蓝色,所以这里用黑色的最后一个字母) \n",
" - 白色 - 'w'\n",
"\n",
"- 线: \n",
" - 直线 - '-' \n",
" - 虚线 - '--' \n",
" - 点线 - ':' \n",
" - 点划线 - '-.'\n",
"\n",
"- 常用点标记:\n",
" - 点 - '.' \n",
" - 像素 - ',' \n",
" - 圆 - 'o' \n",
" - 方形 - 's' \n",
" - 三角形 - '^' \n",
" \n",
"可以在[这里](http://matplotlib.org/api/markers_api.html)查看更多的样式"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用子图"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用子图可以在一个窗口绘制多张图。在调用 `plot()` 函数之前需要先调用 `subplot()` 函数。该函数的第一个参数代表子图的总行数,第二个参数代表子图的总列数,第三个参数代表活跃区域。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0, 2 * np.pi, 50)\n",
"plt.subplot(2, 1, 1) # (行,列,活跃区)\n",
"plt.plot(x, np.sin(x), 'r')\n",
"plt.subplot(2, 1, 2)\n",
"plt.plot(x, np.cos(x), 'g')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 散点图"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"散点图是一堆离散点的集合。用 `Matplotlib` 画散点图也同样非常简单。只需要调用 `scatter()` 函数并传入两个分别代表 `x` 坐标和 `y` 坐标的数组即可。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 简单的散点图\n",
"x = np.linspace(0, 2 * np.pi, 50)\n",
"y = np.sin(x)\n",
"plt.scatter(x,y)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 调整点的大小和颜色"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以给每个点赋予不同的大小"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.random.rand(100)\n",
"y = np.random.rand(100)\n",
"size = np.random.rand(100) * 50\n",
"plt.scatter(x, y, size)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也可以给每个点赋予不同颜色。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.random.rand(100)\n",
"y = np.random.rand(100)\n",
"size = np.random.rand(100) * 50\n",
"color = np.random.rand(100)\n",
"plt.scatter(x, y, size, color)\n",
"plt.colorbar()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 直方图"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 `hist()` 函数可以非常方便的创建直方图。第二个参数代表分段的个数。分段越多,图形上的数据条就越多。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.random.randn(1000)\n",
"plt.hist(x, 50)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 标题,标签和图例"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当需要快速创建图形时,你可能不需要为图形添加标签。但是当构建需要展示的图形时,你就需要添加标题,标签和图例。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0, 2 * np.pi, 50)\n",
"plt.plot(x, np.sin(x), 'r-x', label='Sin(x)')\n",
"plt.plot(x, np.cos(x), 'g-^', label='Cos(x)')\n",
"\n",
"# 展示图例\n",
"plt.legend()\n",
"\n",
"# 给 x 轴添加标签\n",
"plt.xlabel('Rads')\n",
"\n",
"# 给 y 轴添加标签\n",
"plt.ylabel('Amplitude')\n",
"\n",
"# 添加图形标题\n",
"plt.title('Sin and Cos Waves')\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 图片保存"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fruits = ['apple', 'orange', 'pear']\n",
"sales = [100,250,300]\n",
"plt.pie(sales, labels=fruits)\n",
"plt.savefig('pie.png')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以在这里查看更多的[图例](https://matplotlib.org/gallery.html)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Seaborn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Seaborn` 基于 `matplotlib`, 可以快速的绘制一些统计图表。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"import pandas as pd\n",
"sns.set()\n",
"iris = pd.read_csv(\"iris.csv\")\n",
"sns.jointplot(x=\"sepal_length\", y=\"petal_length\", data=iris)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.pairplot(data=iris, hue=\"species\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以在这里查看更多的[示例](https://seaborn.pydata.org/tutorial.html)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3.4 `Scikit-learn`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://imgbed.momodel.cn/scikitlearn.png\" width=300 />\n",
"\n",
"+ **Python** 语言的机器学习工具\n",
"+ `Scikit-learn` 包括大量常用的机器学习算法\n",
"+ `Scikit-learn` 文档完善,容易上手"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 机器学习算法"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法**。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://imgbed.momodel.cn/q2nay75zew.png\" width=800>\n",
"\n",
"由图中,可以看到机器学习 `sklearn` 库的算法主要有四类:分类,回归,聚类,降维。其中:\n",
"\n",
"+ 常用的回归:线性、决策树、`SVM`、`KNN` ; \n",
" 集成回归:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
"+ 常用的分类:线性、决策树、`SVM`、`KNN`,朴素贝叶斯; \n",
" 集成分类:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
"+ 常用聚类:`k` 均值(`K-means`)、层次聚类(`Hierarchical clustering`)、`DBSCAN` \n",
"+ 常用降维:`LinearDiscriminantAnalysis`、`PCA` \n",
"\n",
"这个流程图代表:蓝色圆圈是判断条件,绿色方框是可以选择的算法,我们可以根据自己的数据特征和任务目标去找一条自己的操作路线。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `sklearn` 数据集"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ `sklearn.datasets.load_*()`\n",
" + 获取小规模数据集,数据包含在 `datasets` 里\n",
"+ `sklearn.datasets.fetch_*(data_home=None)`\n",
" + 获取大规模数据集,需要从网络上下载,函数的第一个参数是 `data_home`,表示数据集下载的目录,默认是 `/scikit_learn_data/`\n",
" \n",
"`sklearn` 常见的数据集如下:\n",
"\n",
"||数据集名称|调用方式|适用算法|数据规模|\n",
"|--|--|--|--|--|\n",
"|小数据集|波士顿房价|load_boston()|回归|506\\*13|\n",
"|小数据集|鸢尾花数据集|load_iris()|分类|150\\*4|\n",
"|小数据集|糖尿病数据集|\tload_diabetes()|\t回归\t|442\\*10|\n",
"|大数据集|手写数字数据集|\tload_digits()|\t分类|\t5620\\*64|\n",
"|大数据集|Olivetti脸部图像数据集|\tfetch_olivetti_facecs|\t降维|\t400\\*64\\*64|\n",
"|大数据集|新闻分类数据集|\tfetch_20newsgroups()|\t分类|-|\t \n",
"|大数据集|带标签的人脸数据集|\tfetch_lfw_people()|\t分类、降维|-|\t \n",
"|大数据集|路透社新闻语料数据集|\tfetch_rcv1()|\t分类|\t804414\\*47236|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据预处理\n",
"\n",
"通过**一些转换函数**将特征数据转换成**更加适合算法模型**的特征数据过程。常见的有数据标准化、数据二值化、标签编码、独热编码等。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入内建数据集\n",
"from sklearn.datasets import load_iris\n",
"\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"\n",
"# 获得ndarray格式的变量X和标签y\n",
"X = iris.data\n",
"y = iris.target\n",
"\n",
"# 获得数据维度\n",
"n_samples, n_features = iris.data.shape\n",
"\n",
"print(n_samples, n_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 数据标准化\n",
"\n",
"数据标准化和归一化是将数据映射到一个小的浮点数范围内,以便模型能快速收敛。\n",
"\n",
"标准化有多种方式,常用的一种是min-max标准化(对象名为MinMaxScaler),该方法使数据落到[0,1]区间:\n",
"\n",
"$x^{'}=\\frac{x-x_{min}}{x_{max} - x_{min}}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# min-max标准化\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"sc = MinMaxScaler()\n",
"sc.fit(X)\n",
"results = sc.transform(X)\n",
"print(\"放缩前:\",X[1])\n",
"print(\"放缩后:\",results[1])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另一种是Z-score标准化(对象名为StandardScaler),该方法使数据满足标准正态分布:\n",
"\n",
"$x^{'}=\\frac{x-\\overline {X}}{S}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Z-score标准化\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"#将fit和transform组合执行\n",
"results = StandardScaler().fit_transform(X) \n",
"\n",
"print(\"放缩前:\",X[1])\n",
"print(\"放缩后:\",results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"归一化(对象名为Normalizer,默认为L2归一化):\n",
"\n",
"$x^{'}=\\frac{x}{\\sqrt{\\sum_{j}^{m}x_{j}^2}}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 归一化\n",
"from sklearn.preprocessing import Normalizer\n",
"\n",
"results = Normalizer().fit_transform(X)\n",
"\n",
"print(\"放缩前:\",X[1])\n",
"print(\"放缩后:\",results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 数据二值化\n",
"\n",
"使用阈值过滤器将数据转化为布尔值,即为二值化。使用Binarizer对象实现数据的二值化:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 二值化,阈值设置为3\n",
"from sklearn.preprocessing import Binarizer\n",
"\n",
"results = Binarizer(threshold=3).fit_transform(X)\n",
"\n",
"print(\"处理前:\",X[1])\n",
"print(\"处理后:\",results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 标签编码\n",
"\n",
"使用 LabelEncoder 将不连续的数值或文本变量转化为有序的数值型变量:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 标签编码\n",
"from sklearn.preprocessing import LabelEncoder\n",
"LabelEncoder().fit_transform(['apple','pear','orange','banana'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 独热编码\n",
"\n",
"对于无序的离散型特征,其数值大小并没有意义,需要对其进行one-hot编码,将其特征的m个可能值转化为m个二值化特征。可以利用OneHotEncoder对象实现:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 独热编码\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"results = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()\n",
"\n",
"print(\"处理前:\",X[1])\n",
"print(\"处理后:\",results[1])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据集的划分\n",
"\n",
"机器学习一般的数据集会划分为两个部分:\n",
"+ 训练数据:用于训练,构建模型\n",
"+ 测试数据:在模型检验时使用,用于评估模型是否有效\n",
"\n",
"<br>\n",
"\n",
"划分比例:\n",
"+ 训练集:70% 80% 75%\n",
"+ 测试集:30% 20% 25%\n",
"\n",
"<br>\n",
"`sklearn.model_selection.train_test_split(x, y, test_size, random_state )`\n",
" + `x`:数据集的特征值\n",
" + `y`: 数据集的标签值\n",
" + `test_size`: 如果是浮点数,表示测试集样本占比;如果是整数,表示测试集样本的数量。\n",
" + `random_state`: 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。\n",
" + `return` 训练集的特征值 `x_train` 测试集的特征值 `x_test` 训练集的目标值 `y_train` 测试集的目标值 `y_test`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# 加载数据集\n",
"iris = load_iris()\n",
"\n",
"# 对数据集进行分割\n",
"# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test\n",
"X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=22)\n",
"\n",
"print(\"x_train:\", X_train.shape)\n",
"print(\"y_train:\", y_train.shape)\n",
"print(\"x_test:\", X_test.shape)\n",
"print(\"y_test:\", y_test.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 定义模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 估计器(`Estimator`)\n",
"估计器,很多时候可以直接理解成分类器,主要包含两个函数:\n",
"\n",
"+ `fit()`:训练算法,设置内部参数。接收训练集和类别两个参数。\n",
"+ `predict()`:预测测试集类别,参数为测试集。\n",
"\n",
"大多数 `scikit-learn` 估计器接收和输出的数据格式均为 `NumPy`数组或类似格式。\n",
"\n",
"<br>\n",
"\n",
"#### 转换器(`Transformer`) \n",
"转换器用于数据预处理和数据转换,主要是三个方法:\n",
"\n",
"+ `fit()`:训练算法,设置内部参数。\n",
"+ `transform()`:数据转换。\n",
"+ `fit_transform()`:合并 `fit` 和 `transform` 两个方法。\n",
"\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在 `scikit-learn` 中,所有模型都有同样的接口供调用。监督学习模型都具有以下的方法:\n",
"+ `fit`:对数据进行拟合。\n",
"+ `set_params`:设定模型参数。\n",
"+ `get_params`:返回模型参数。\n",
"+ `predict`:在指定的数据集上预测。\n",
"+ `score`:返回预测器的得分。\n",
"\n",
"鸢尾花数据集是一个分类任务,故以决策树模型为例,采用默认参数拟合模型,并对验证集预测。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 决策树分类器\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"# 定义模型\n",
"model = DecisionTreeClassifier()\n",
"\n",
"# 训练模型\n",
"model.fit(X_train, y_train)\n",
"\n",
"# 在测试集上预测\n",
"model.predict(X_test)\n",
"\n",
"# 测试集上的得分(默认为准确率)\n",
"model.score(X_test, y_test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`scikit-learn` 中所有模型的调用方式都类似。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 模型评估\n",
"\n",
"评估模型的常用方法为 `K` 折交叉验证,它将数据集划分为 `K` 个大小相近的子集(`K` 通常取 `10`),每次选择其中(`K-1`)个子集的并集做为训练集,余下的做为测试集,总共得到 `K` 组训练集&测试集,最终返回这 `K` 次测试结果的得分,取其均值可作为选定最终模型的指标。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 交叉验证\n",
"from sklearn.model_selection import cross_val_score\n",
"cross_val_score(model, X, y, scoring=None, cv=10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"注意:由于之前采用了 `train_test_split` 分割数据集,它默认对数据进行了洗牌,所以这里可以直接使用 `cv=10` 来进行 `10` 折交叉验证(`cross_val_score` 不会对数据进行洗牌)。如果之前未对数据进行洗牌,则要搭配使用 `KFold` 模块:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"n_folds = 10\n",
"kf = KFold(n_folds, shuffle=True).get_n_splits(X)\n",
"cross_val_score(model, X, y, scoring=None, cv = kf)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 保存与加载模型\n",
"\n",
"在训练模型后可将模型保存,以免下次重复训练。保存与加载模型使用 `sklearn` 的 `joblib`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.externals import joblib\n",
"\n",
"# 保存模型\n",
"joblib.dump(model,'myModel.pkl')\n",
"\n",
"# 加载模型\n",
"model=joblib.load('myModel.pkl')\n",
"print(model)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面我们用一个小例子来展示如何使用 `sklearn` 工具包快速完成一个机器学习项目。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 采用逻辑回归模型实现鸢尾花分类\n",
"\n",
"\n",
"**线性回归**\n",
"\n",
"在介绍逻辑回归之前先介绍一下线性回归,线性回归的主要思想是通过历史数据拟合出一条直线,因变量与自变量是线性关系,对新的数据用这条直线进行预测。 线性回归的公式如下:\n",
"\n",
"$y = w_{0}+w_{1}x_{1}+...+w_{n}x_{n}=w^{T}x+b$\n",
"\n",
"**逻辑回归**\n",
"\n",
"逻辑回归是一种广义的线性回归分析模型,是一种预测分析。虽然它名字里带回归,但实际上是一种分类学习方法。它不是仅预测出“类别”, 而是可以得到近似概率预测,这对于许多需要利用概率辅助决策的任务很有用。普遍应用于预测一个实例是否属于一个特定类别的概率,比如一封 `email` 是垃圾邮件的概率是多少。 因变量可以是二分类的,也可以是多分类的。因为结果是概率的,除了分类外还可以做 `ranking model`。逻辑的应用场景很多,如点击率预测(`CTR`)、天气预测、一些电商的购物搭配推荐、一些电商的搜索排序基线等。\n",
"\n",
"`sigmoid` **函数**\n",
"\n",
"`Sigmoid` 函数,呈现S型曲线,它将值转化为一个接近 `0` 或 `1` 的 `y` 值。 \n",
"$y = g(z)=\\frac{1}{1+e^{-z}}$ 其中:$z = w^{T}x+b$ \n",
"\n",
"\n",
"**鸢尾花数据集**\n",
"\n",
"`sklearn.datasets.load_iris()`:加载并返回鸢尾花数据集\n",
"\n",
"`Iris` 鸢尾花卉数据集,是常用的分类实验数据集,由 `R.A. Fisher` 于 `1936` 年收集整理的。其中包含 `3` 种植物种类,分别是山鸢尾(`setosa`)变色鸢尾(`versicolor`)和维吉尼亚鸢尾(`virginica`),每类 `50` 个样本,共 `150` 个样本。 \n",
"\n",
"|变量名|\t变量解释|\t数据类型|\n",
"|--|--|--|\n",
"|sepal_length|\t花萼长度(单位cm)|\tnumeric|\n",
"|sepal_width|\t花萼宽度(单位cm)|\tnumeric|\n",
"|petal_length\t|花瓣长度(单位cm)|\tnumeric|\n",
"|petal_width|\t花瓣宽度(单位cm)|\tnumeric|\n",
"|species\t|种类\t|categorical|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.获取数据集及其信息"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"鸢尾花的特征值:\\n\", iris[\"data\"][1])\n",
"print(\"鸢尾花的目标值:\\n\", iris.target)\n",
"print(\"鸢尾花特征的名字:\\n\", iris.feature_names)\n",
"print(\"鸢尾花目标值的名字:\\n\", iris.target_names)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 取出特征值\n",
"X = iris.data\n",
"y = iris.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.数据划分"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2.数据划分\n",
"from sklearn.model_selection import train_test_split\n",
"X_train,X_test,Y_train,Y_test = train_test_split(X, y, test_size=0.1, random_state=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.数据标准化"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"transfer = StandardScaler()\n",
"X_train = transfer.fit_transform(X_train)\n",
"X_test = transfer.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.模型构建"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"estimator = LogisticRegression(penalty='l2',solver='newton-cg',multi_class='multinomial')\n",
"estimator.fit(X_train,Y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.模型评估"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 5.模型评估\n",
"print(\"\\n得出来的权重:\", estimator.coef_)\n",
"print(\"\\nLogistic Regression模型训练集的准确率:%.1f%%\" %(estimator.score(X_train, Y_train)*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6. 模型预测"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import metrics\n",
"y_predict = estimator.predict(X_test)\n",
"print(\"\\n预测结果为:\\n\", y_predict)\n",
"print(\"\\n比对真实值和预测值:\\n\", y_predict == Y_test)\n",
"\n",
"# 预测的准确率\n",
"accuracy = metrics.accuracy_score(Y_test, y_predict)\n",
"print(\"\\nLogistic Regression 模型测试集的正确率:%.1f%%\" %(accuracy*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.交叉验证"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"import numpy as np\n",
"scores = cross_val_score(estimator, X, y, scoring=None, cv=10) #cv为迭代次数。\n",
"print(\"\\n交叉验证的准确率:\",np.round(scores,2)) # 打印输出每次迭代的度量值(准确度)\n",
"print(\"\\n交叉验证结果的置信区间: %0.2f%%(+/- %0.2f)\" % (scores.mean()*100, scores.std() * 2)) # 获取置信区间。(也就是均值和方差)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}