{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn基础"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img src=\"http://imgbed.momodel.cn/scikitlearn.png\" width=300 />\n",
"\n",
"+ **Python** 语言的机器学习工具\n",
"+ `Scikit-learn` 包括大量常用的机器学习算法\n",
"+ `Scikit-learn` 文档完善,容易上手"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 机器学习算法"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"**机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法**。\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img src=\"http://imgbed.momodel.cn/q2nay75zew.png\" width=800>\n",
"\n",
"由图中,可以看到机器学习 `sklearn` 库的算法主要有四类:分类,回归,聚类,降维。其中:\n",
"\n",
"+ 常用的回归:线性、决策树、`SVM`、`KNN` ; \n",
" 集成回归:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
"+ 常用的分类:线性、决策树、`SVM`、`KNN`、朴素贝叶斯; \n",
" 集成分类:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
"+ 常用聚类:`k` 均值(`K-means`)、层次聚类(`Hierarchical clustering`)、`DBSCAN` \n",
"+ 常用降维:`LinearDiscriminantAnalysis`、`PCA` \n",
"\n",
"这个流程图代表:蓝色圆圈是判断条件,绿色方框是可以选择的算法,我们可以根据自己的数据特征和任务目标去找一条自己的操作路线。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `sklearn` 数据集"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"+ `sklearn.datasets.load_*()`\n",
" + 获取小规模数据集,数据包含在 `datasets` 里\n",
"+ `sklearn.datasets.fetch_*(data_home=None)`\n",
" + 获取大规模数据集,需要从网络上下载,函数的第一个参数是 `data_home`,表示数据集下载的目录,默认是 `/scikit_learn_data/`\n",
" \n",
"`sklearn` 常见的数据集如下:\n",
"\n",
"||数据集名称|调用方式|适用算法|数据规模|\n",
"|--|--|--|--|--|\n",
"|小数据集|波士顿房价|load_boston()|回归|506\\*13|\n",
"|小数据集|鸢尾花数据集|load_iris()|分类|150\\*4|\n",
"|小数据集|糖尿病数据集|\tload_diabetes()|\t回归\t|442\\*10|\n",
"|大数据集|手写数字数据集|\tload_digits()|\t分类|\t5620\\*64|\n",
"|大数据集|Olivetti脸部图像数据集|\tfetch_olivetti_facecs|\t降维|\t400\\*64\\*64|\n",
"|大数据集|新闻分类数据集|\tfetch_20newsgroups()|\t分类|-|\t \n",
"|大数据集|带标签的人脸数据集|\tfetch_lfw_people()|\t分类、降维|-|\t \n",
"|大数据集|路透社新闻语料数据集|\tfetch_rcv1()|\t分类|\t804414\\*47236|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据预处理\n",
"\n",
"通过**一些转换函数**将特征数据转换成**更加适合算法模型**的特征数据过程。常见的有数据标准化、数据二值化、标签编码、独热编码等。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 导入内建数据集\n",
"from sklearn.datasets import load_iris\n",
"\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"\n",
"# 获得 ndarray 格式的变量 X 和标签 y\n",
"X = iris.data\n",
"y = iris.target\n",
"\n",
"# 获得数据维度\n",
"n_samples, n_features = iris.data.shape\n",
"\n",
"print(n_samples, n_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 数据标准化\n",
"\n",
"数据标准化和归一化是将数据映射到一个小的浮点数范围内,以便模型能快速收敛。\n",
"\n",
"标准化有多种方式,常用的一种是min-max标准化(对象名为MinMaxScaler),该方法使数据落到[0,1]区间:\n",
"\n",
"$x^{'}=\\frac{x-x_{min}}{x_{max} - x_{min}}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# min-max标准化\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"sc = MinMaxScaler()\n",
"sc.fit(X)\n",
"results = sc.transform(X)\n",
"print(\"放缩前:\", X[1])\n",
"print(\"放缩后:\", results[1])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"另一种是Z-score标准化(对象名为StandardScaler),该方法使数据满足标准正态分布:\n",
"\n",
"$x^{'}=\\frac{x-\\overline {X}}{S}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Z-score标准化\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"#将fit和transform组合执行\n",
"results = StandardScaler().fit_transform(X) \n",
"\n",
"print(\"放缩前:\", X[1])\n",
"print(\"放缩后:\", results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"归一化(对象名为Normalizer,默认为L2归一化):\n",
"\n",
"$x^{'}=\\frac{x}{\\sqrt{\\sum_{j}^{m}x_{j}^2}}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 归一化\n",
"from sklearn.preprocessing import Normalizer\n",
"\n",
"results = Normalizer().fit_transform(X) \n",
"\n",
"print(\"放缩前:\", X[1])\n",
"print(\"放缩后:\", results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 数据二值化\n",
"\n",
"使用阈值过滤器将数据转化为布尔值,即为二值化。使用Binarizer对象实现数据的二值化:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 二值化,阈值设置为3\n",
"from sklearn.preprocessing import Binarizer\n",
"\n",
"results = Binarizer(threshold=3).fit_transform(X)\n",
"\n",
"print(\"处理前:\", X[1])\n",
"print(\"处理后:\", results[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 标签编码\n",
"\n",
"使用 LabelEncoder 将不连续的数值或文本变量转化为有序的数值型变量:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 标签编码\n",
"from sklearn.preprocessing import LabelEncoder\n",
"LabelEncoder().fit_transform(['apple', 'pear', 'orange', 'banana'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 独热编码\n",
"\n",
"对于无序的离散型特征,其数值大小并没有意义,需要对其进行one-hot编码,将其特征的m个可能值转化为m个二值化特征。可以利用OneHotEncoder对象实现:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 独热编码\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"results = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()\n",
"\n",
"print(\"处理前:\", y)\n",
"print(\"处理后:\", results[1])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据集的划分\n",
"\n",
"机器学习一般的数据集会划分为两个部分:\n",
"+ 训练数据:用于训练,构建模型\n",
"+ 测试数据:在模型检验时使用,用于评估模型是否有效\n",
"\n",
"<br>\n",
"\n",
"划分比例:\n",
"+ 训练集:70% 80% 75%\n",
"+ 测试集:30% 20% 25%\n",
"\n",
"<br>\n",
"\n",
"`sklearn.model_selection.train_test_split(x, y, test_size, random_state )`\n",
"\n",
" + `x`:数据集的特征值\n",
" + `y`: 数据集的标签值\n",
" + `test_size`: 如果是浮点数,表示测试集样本占比;如果是整数,表示测试集样本的数量。\n",
" + `random_state`: 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。\n",
" + `return` 训练集的特征值 `x_train` 测试集的特征值 `x_test` 训练集的目标值 `y_train` 测试集的目标值 `y_test`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# 加载数据集\n",
"iris = load_iris()\n",
"\n",
"# 对数据集进行分割\n",
"# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test\n",
"X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=22)\n",
"\n",
"print(\"x_train:\", X_train.shape)\n",
"print(\"y_train:\", y_train.shape) \n",
"print(\"x_test:\", X_test.shape)\n",
"print(\"y_test:\", y_test.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 定义模型"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"#### 估计器(`Estimator`)\n",
"估计器,很多时候可以直接理解成分类器,主要包含两个函数:\n",
"\n",
"+ `fit()`:训练算法,设置内部参数。接收训练集和类别两个参数。\n",
"+ `predict()`:预测测试集类别,参数为测试集。\n",
"\n",
"大多数 `scikit-learn` 估计器接收和输出的数据格式均为 `NumPy`数组或类似格式。\n",
"\n",
"<br>\n",
"\n",
"#### 转换器(`Transformer`) \n",
"转换器用于数据预处理和数据转换,主要是三个方法:\n",
"\n",
"+ `fit()`:训练算法,设置内部参数。\n",
"+ `transform()`:数据转换。\n",
"+ `fit_transform()`:合并 `fit` 和 `transform` 两个方法。\n",
"\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在 `scikit-learn` 中,所有模型都有同样的接口供调用。监督学习模型都具有以下的方法:\n",
"+ `fit`:对数据进行拟合。\n",
"+ `set_params`:设定模型参数。\n",
"+ `get_params`:返回模型参数。\n",
"+ `predict`:在指定的数据集上预测。\n",
"+ `score`:返回预测器的得分。\n",
"\n",
"鸢尾花数据集是一个分类任务,故以决策树模型为例,采用默认参数拟合模型,并对验证集预测。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 决策树分类器\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"# 定义模型\n",
"model = DecisionTreeClassifier()\n",
"\n",
"# 训练模型\n",
"model.fit(X_train, y_train)\n",
"\n",
"# 在测试集上预测\n",
"model.predict(X_test)\n",
"\n",
"# 测试集上的得分(默认为准确率)\n",
"model.score(X_test, y_test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"`scikit-learn` 中所有模型的调用方式都类似。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 模型评估\n",
"\n",
"评估模型的常用方法为 `K` 折交叉验证,它将数据集划分为 `K` 个大小相近的子集(`K` 通常取 `10`),每次选择其中(`K-1`)个子集的并集做为训练集,余下的做为测试集,总共得到 `K` 组训练集&测试集,最终返回这 `K` 次测试结果的得分,取其均值可作为选定最终模型的指标。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 交叉验证\n",
"from sklearn.model_selection import cross_val_score\n",
"cross_val_score(model, X, y, scoring=None, cv=10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"注意:由于之前采用了 `train_test_split` 分割数据集,它默认对数据进行了洗牌,所以这里可以直接使用 `cv=10` 来进行 `10` 折交叉验证(`cross_val_score` 不会对数据进行洗牌)。如果之前未对数据进行洗牌,则要搭配使用 `KFold` 模块:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"n_folds = 10\n",
"kf = KFold(n_folds, shuffle=True).get_n_splits(X)\n",
"cross_val_score(model, X, y, scoring=None, cv = kf)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 保存与加载模型\n",
"\n",
"在训练模型后可将模型保存,以免下次重复训练。保存与加载模型使用 `sklearn` 的 `joblib`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.externals import joblib\n",
"\n",
"# 保存模型\n",
"joblib.dump(model,'myModel.pkl')\n",
"\n",
"# 加载模型\n",
"model=joblib.load('myModel.pkl')\n",
"print(model)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面我们用一个小例子来展示如何使用 `sklearn` 工具包快速完成一个机器学习项目。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 采用逻辑回归模型实现鸢尾花分类\n",
"\n",
"\n",
"**线性回归**\n",
"\n",
"在介绍逻辑回归之前先介绍一下线性回归,线性回归的主要思想是通过历史数据拟合出一条直线,因变量与自变量是线性关系,对新的数据用这条直线进行预测。 线性回归的公式如下:\n",
"\n",
"$y = w_{0}+w_{1}x_{1}+...+w_{n}x_{n}=w^{T}x+b$\n",
"\n",
"**逻辑回归**\n",
"\n",
"逻辑回归是一种广义的线性回归分析模型,是一种预测分析。虽然它名字里带回归,但实际上是一种分类学习方法。它不是仅预测出“类别”, 而是可以得到近似概率预测,这对于许多需要利用概率辅助决策的任务很有用。普遍应用于预测一个实例是否属于一个特定类别的概率,比如一封 `email` 是垃圾邮件的概率是多少。 因变量可以是二分类的,也可以是多分类的。因为结果是概率的,除了分类外还可以做 `ranking model`。逻辑的应用场景很多,如点击率预测(`CTR`)、天气预测、一些电商的购物搭配推荐、一些电商的搜索排序基线等。\n",
"\n",
"`sigmoid` **函数**\n",
"\n",
"`Sigmoid` 函数,呈现S型曲线,它将值转化为一个接近 `0` 或 `1` 的 `y` 值。 \n",
"$y = g(z)=\\frac{1}{1+e^{-z}}$ 其中:$z = w^{T}x+b$ \n",
"\n",
"\n",
"**鸢尾花数据集**\n",
"\n",
"<center><img src=\"http://imgbed.momodel.cn//20200324144418.png\" width=700></center>\n",
"\n",
"`sklearn.datasets.load_iris()`:加载并返回鸢尾花数据集\n",
"\n",
"`Iris` 鸢尾花卉数据集,是常用的分类实验数据集,由 `R.A. Fisher` 于 `1936` 年收集整理的。其中包含 `3` 种植物种类,分别是山鸢尾(`setosa`)变色鸢尾(`versicolor`)和维吉尼亚鸢尾(`virginica`),每类 `50` 个样本,共 `150` 个样本。 \n",
"\n",
"|变量名|\t变量解释|\t数据类型|\n",
"|--|--|--|\n",
"|sepal_length|\t花萼长度(单位cm)|\tnumeric|\n",
"|sepal_width|\t花萼宽度(单位cm)|\tnumeric|\n",
"|petal_length\t|花瓣长度(单位cm)|\tnumeric|\n",
"|petal_width|\t花瓣宽度(单位cm)|\tnumeric|\n",
"|species\t|种类\t|categorical|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.获取数据集及其信息"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris\n",
"# 获取鸢尾花数据集\n",
"iris = load_iris()\n",
"print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"print(\"鸢尾花的特征值:\\n\", iris[\"data\"][1])\n",
"print(\"鸢尾花的目标值:\\n\", iris.target)\n",
"print(\"鸢尾花特征的名字:\\n\", iris.feature_names)\n",
"print(\"鸢尾花目标值的名字:\\n\", iris.target_names)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# 取出特征值\n",
"X = iris.data\n",
"y = iris.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.数据划分"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.1, random_state=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.数据标准化"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"transfer = StandardScaler()\n",
"X_train = transfer.fit_transform(X_train)\n",
"X_test = transfer.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.模型构建"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"estimator = LogisticRegression(penalty='l2',solver='newton-cg',multi_class='multinomial')\n",
"estimator.fit(X_train,Y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.模型评估"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"print(\"\\n得出来的权重:\", estimator.coef_)\n",
"print(\"\\nLogistic Regression模型训练集的准确率:%.1f%%\" %(estimator.score(X_train, Y_train)*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6. 模型预测"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn import metrics\n",
"y_predict = estimator.predict(X_test)\n",
"print(\"\\n预测结果为:\\n\", y_predict)\n",
"print(\"\\n比对真实值和预测值:\\n\", y_predict == Y_test)\n",
"\n",
"# 预测的准确率\n",
"accuracy = metrics.accuracy_score(Y_test, y_predict)\n",
"print(\"\\nLogistic Regression 模型测试集的正确率:%.1f%%\" %(accuracy*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.交叉验证"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"import numpy as np\n",
"scores = cross_val_score(estimator, X, y, scoring=None, cv=10) #cv为迭代次数。\n",
"print(\"\\n交叉验证的准确率:\",np.round(scores,2)) # 打印输出每次迭代的度量值(准确度)\n",
"print(\"\\n交叉验证结果的置信区间: %0.2f%%(+/- %0.2f)\" % (scores.mean()*100, scores.std() * 2)) # 获取置信区间。(也就是均值和方差)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}