master
/ 4Sklearn 基础.ipynb

4Sklearn 基础.ipynb @96fc089

96fc089
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn基础"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src=\"http://imgbed.momodel.cn/scikitlearn.png\" width=300 />\n",
    "\n",
    "+ **Python** 语言的机器学习工具\n",
    "+ `Scikit-learn` 包括大量常用的机器学习算法\n",
    "+ `Scikit-learn` 文档完善,容易上手"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 机器学习算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法**。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src=\"http://imgbed.momodel.cn/q2nay75zew.png\" width=800>\n",
    "\n",
    "由图中,可以看到机器学习 `sklearn` 库的算法主要有四类:分类,回归,聚类,降维。其中:\n",
    "\n",
    "+ 常用的回归:线性、决策树、`SVM`、`KNN` ;  \n",
    "    集成回归:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
    "+ 常用的分类:线性、决策树、`SVM`、`KNN`、朴素贝叶斯;  \n",
    "    集成分类:随机森林、`Adaboost`、`GradientBoosting`、`Bagging`、`ExtraTrees` \n",
    "+ 常用聚类:`k` 均值(`K-means`)、层次聚类(`Hierarchical clustering`)、`DBSCAN` \n",
    "+ 常用降维:`LinearDiscriminantAnalysis`、`PCA`     \n",
    "\n",
    "这个流程图代表:蓝色圆圈是判断条件,绿色方框是可以选择的算法,我们可以根据自己的数据特征和任务目标去找一条自己的操作路线。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `sklearn` 数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "+ `sklearn.datasets.load_*()`\n",
    "    + 获取小规模数据集,数据包含在 `datasets` 里\n",
    "+ `sklearn.datasets.fetch_*(data_home=None)`\n",
    "    + 获取大规模数据集,需要从网络上下载,函数的第一个参数是 `data_home`,表示数据集下载的目录,默认是 `/scikit_learn_data/`\n",
    "    \n",
    "`sklearn` 常见的数据集如下:\n",
    "\n",
    "||数据集名称|调用方式|适用算法|数据规模|\n",
    "|--|--|--|--|--|\n",
    "|小数据集|波士顿房价|load_boston()|回归|506\\*13|\n",
    "|小数据集|鸢尾花数据集|load_iris()|分类|150\\*4|\n",
    "|小数据集|糖尿病数据集|\tload_diabetes()|\t回归\t|442\\*10|\n",
    "|大数据集|手写数字数据集|\tload_digits()|\t分类|\t5620\\*64|\n",
    "|大数据集|Olivetti脸部图像数据集|\tfetch_olivetti_facecs|\t降维|\t400\\*64\\*64|\n",
    "|大数据集|新闻分类数据集|\tfetch_20newsgroups()|\t分类|-|\t \n",
    "|大数据集|带标签的人脸数据集|\tfetch_lfw_people()|\t分类、降维|-|\t \n",
    "|大数据集|路透社新闻语料数据集|\tfetch_rcv1()|\t分类|\t804414\\*47236|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据预处理\n",
    "\n",
    "通过**一些转换函数**将特征数据转换成**更加适合算法模型**的特征数据过程。常见的有数据标准化、数据二值化、标签编码、独热编码等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 导入内建数据集\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "\n",
    "# 获得 ndarray 格式的变量 X 和标签 y\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "# 获得数据维度\n",
    "n_samples, n_features = iris.data.shape\n",
    "\n",
    "print(n_samples, n_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据标准化\n",
    "\n",
    "数据标准化和归一化是将数据映射到一个小的浮点数范围内,以便模型能快速收敛。\n",
    "\n",
    "标准化有多种方式,常用的一种是min-max标准化(对象名为MinMaxScaler),该方法使数据落到[0,1]区间:\n",
    "\n",
    "$x^{'}=\\frac{x-x_{min}}{x_{max} - x_{min}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# min-max标准化\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "sc = MinMaxScaler()\n",
    "sc.fit(X)\n",
    "results = sc.transform(X)\n",
    "print(\"放缩前:\", X[1])\n",
    "print(\"放缩后:\", results[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "另一种是Z-score标准化(对象名为StandardScaler),该方法使数据满足标准正态分布:\n",
    "\n",
    "$x^{'}=\\frac{x-\\overline {X}}{S}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Z-score标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "#将fit和transform组合执行\n",
    "results = StandardScaler().fit_transform(X) \n",
    "\n",
    "print(\"放缩前:\", X[1])\n",
    "print(\"放缩后:\", results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "归一化(对象名为Normalizer,默认为L2归一化):\n",
    "\n",
    "$x^{'}=\\frac{x}{\\sqrt{\\sum_{j}^{m}x_{j}^2}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 归一化\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "results = Normalizer().fit_transform(X) \n",
    "\n",
    "print(\"放缩前:\", X[1])\n",
    "print(\"放缩后:\", results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据二值化\n",
    "\n",
    "使用阈值过滤器将数据转化为布尔值,即为二值化。使用Binarizer对象实现数据的二值化:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 二值化,阈值设置为3\n",
    "from sklearn.preprocessing import Binarizer\n",
    "\n",
    "results = Binarizer(threshold=3).fit_transform(X)\n",
    "\n",
    "print(\"处理前:\", X[1])\n",
    "print(\"处理后:\", results[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 标签编码\n",
    "\n",
    "使用 LabelEncoder 将不连续的数值或文本变量转化为有序的数值型变量:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 标签编码\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "LabelEncoder().fit_transform(['apple', 'pear', 'orange', 'banana'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 独热编码\n",
    "\n",
    "对于无序的离散型特征,其数值大小并没有意义,需要对其进行one-hot编码,将其特征的m个可能值转化为m个二值化特征。可以利用OneHotEncoder对象实现:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 独热编码\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "results = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()\n",
    "\n",
    "print(\"处理前:\", y)\n",
    "print(\"处理后:\", results[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据集的划分\n",
    "\n",
    "机器学习一般的数据集会划分为两个部分:\n",
    "+ 训练数据:用于训练,构建模型\n",
    "+ 测试数据:在模型检验时使用,用于评估模型是否有效\n",
    "\n",
    "<br>\n",
    "\n",
    "划分比例:\n",
    "+ 训练集:70% 80% 75%\n",
    "+ 测试集:30% 20% 25%\n",
    "\n",
    "<br>\n",
    "\n",
    "`sklearn.model_selection.train_test_split(x, y, test_size, random_state )`\n",
    "\n",
    "   +  `x`:数据集的特征值\n",
    "   +  `y`: 数据集的标签值\n",
    "   +  `test_size`: 如果是浮点数,表示测试集样本占比;如果是整数,表示测试集样本的数量。\n",
    "   +  `random_state`: 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。\n",
    "   +  `return` 训练集的特征值 `x_train` 测试集的特征值 `x_test` 训练集的目标值 `y_train` 测试集的目标值 `y_test`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 加载数据集\n",
    "iris = load_iris()\n",
    "\n",
    "# 对数据集进行分割\n",
    "# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=22)\n",
    "\n",
    "print(\"x_train:\", X_train.shape)\n",
    "print(\"y_train:\", y_train.shape) \n",
    "print(\"x_test:\", X_test.shape)\n",
    "print(\"y_test:\", y_test.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### 估计器(`Estimator`)\n",
    "估计器,很多时候可以直接理解成分类器,主要包含两个函数:\n",
    "\n",
    "+ `fit()`:训练算法,设置内部参数。接收训练集和类别两个参数。\n",
    "+ `predict()`:预测测试集类别,参数为测试集。\n",
    "\n",
    "大多数 `scikit-learn` 估计器接收和输出的数据格式均为 `NumPy`数组或类似格式。\n",
    "\n",
    "<br>\n",
    "\n",
    "#### 转换器(`Transformer`)  \n",
    "转换器用于数据预处理和数据转换,主要是三个方法:\n",
    "\n",
    "+ `fit()`:训练算法,设置内部参数。\n",
    "+ `transform()`:数据转换。\n",
    "+ `fit_transform()`:合并 `fit` 和 `transform` 两个方法。\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在 `scikit-learn` 中,所有模型都有同样的接口供调用。监督学习模型都具有以下的方法:\n",
    "+ `fit`:对数据进行拟合。\n",
    "+ `set_params`:设定模型参数。\n",
    "+ `get_params`:返回模型参数。\n",
    "+ `predict`:在指定的数据集上预测。\n",
    "+ `score`:返回预测器的得分。\n",
    "\n",
    "鸢尾花数据集是一个分类任务,故以决策树模型为例,采用默认参数拟合模型,并对验证集预测。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 决策树分类器\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# 定义模型\n",
    "model = DecisionTreeClassifier()\n",
    "\n",
    "# 训练模型\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# 在测试集上预测\n",
    "model.predict(X_test)\n",
    "\n",
    "# 测试集上的得分(默认为准确率)\n",
    "model.score(X_test, y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`scikit-learn` 中所有模型的调用方式都类似。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模型评估\n",
    "\n",
    "评估模型的常用方法为 `K` 折交叉验证,它将数据集划分为 `K` 个大小相近的子集(`K` 通常取 `10`),每次选择其中(`K-1`)个子集的并集做为训练集,余下的做为测试集,总共得到 `K` 组训练集&测试集,最终返回这 `K` 次测试结果的得分,取其均值可作为选定最终模型的指标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 交叉验证\n",
    "from sklearn.model_selection import cross_val_score\n",
    "cross_val_score(model, X, y, scoring=None, cv=10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "注意:由于之前采用了 `train_test_split` 分割数据集,它默认对数据进行了洗牌,所以这里可以直接使用 `cv=10` 来进行 `10` 折交叉验证(`cross_val_score` 不会对数据进行洗牌)。如果之前未对数据进行洗牌,则要搭配使用 `KFold` 模块:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold\n",
    "n_folds = 10\n",
    "kf = KFold(n_folds, shuffle=True).get_n_splits(X)\n",
    "cross_val_score(model, X, y, scoring=None, cv = kf)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 保存与加载模型\n",
    "\n",
    "在训练模型后可将模型保存,以免下次重复训练。保存与加载模型使用 `sklearn` 的 `joblib`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.externals import joblib\n",
    "\n",
    "# 保存模型\n",
    "joblib.dump(model,'myModel.pkl')\n",
    "\n",
    "# 加载模型\n",
    "model=joblib.load('myModel.pkl')\n",
    "print(model)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面我们用一个小例子来展示如何使用 `sklearn` 工具包快速完成一个机器学习项目。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 采用逻辑回归模型实现鸢尾花分类\n",
    "\n",
    "\n",
    "**线性回归**\n",
    "\n",
    "在介绍逻辑回归之前先介绍一下线性回归,线性回归的主要思想是通过历史数据拟合出一条直线,因变量与自变量是线性关系,对新的数据用这条直线进行预测。 线性回归的公式如下:\n",
    "\n",
    "$y = w_{0}+w_{1}x_{1}+...+w_{n}x_{n}=w^{T}x+b$\n",
    "\n",
    "**逻辑回归**\n",
    "\n",
    "逻辑回归是一种广义的线性回归分析模型,是一种预测分析。虽然它名字里带回归,但实际上是一种分类学习方法。它不是仅预测出“类别”, 而是可以得到近似概率预测,这对于许多需要利用概率辅助决策的任务很有用。普遍应用于预测一个实例是否属于一个特定类别的概率,比如一封 `email` 是垃圾邮件的概率是多少。 因变量可以是二分类的,也可以是多分类的。因为结果是概率的,除了分类外还可以做 `ranking model`。逻辑的应用场景很多,如点击率预测(`CTR`)、天气预测、一些电商的购物搭配推荐、一些电商的搜索排序基线等。\n",
    "\n",
    "`sigmoid` **函数**\n",
    "\n",
    "`Sigmoid` 函数,呈现S型曲线,它将值转化为一个接近 `0` 或 `1` 的 `y` 值。  \n",
    "$y = g(z)=\\frac{1}{1+e^{-z}}$   其中:$z = w^{T}x+b$ \n",
    "\n",
    "\n",
    "**鸢尾花数据集**\n",
    "\n",
    "<center><img src=\"http://imgbed.momodel.cn//20200324144418.png\" width=700></center>\n",
    "\n",
    "`sklearn.datasets.load_iris()`:加载并返回鸢尾花数据集\n",
    "\n",
    "`Iris` 鸢尾花卉数据集,是常用的分类实验数据集,由 `R.A. Fisher` 于 `1936` 年收集整理的。其中包含 `3` 种植物种类,分别是山鸢尾(`setosa`)变色鸢尾(`versicolor`)和维吉尼亚鸢尾(`virginica`),每类 `50` 个样本,共 `150` 个样本。  \n",
    "\n",
    "|变量名|\t变量解释|\t数据类型|\n",
    "|--|--|--|\n",
    "|sepal_length|\t花萼长度(单位cm)|\tnumeric|\n",
    "|sepal_width|\t花萼宽度(单位cm)|\tnumeric|\n",
    "|petal_length\t|花瓣长度(单位cm)|\tnumeric|\n",
    "|petal_width|\t花瓣宽度(单位cm)|\tnumeric|\n",
    "|species\t|种类\t|categorical|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.获取数据集及其信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "# 获取鸢尾花数据集\n",
    "iris = load_iris()\n",
    "print(\"鸢尾花数据集的返回值:\\n\", iris.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "print(\"鸢尾花的特征值:\\n\", iris[\"data\"][1])\n",
    "print(\"鸢尾花的目标值:\\n\", iris.target)\n",
    "print(\"鸢尾花特征的名字:\\n\", iris.feature_names)\n",
    "print(\"鸢尾花目标值的名字:\\n\", iris.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# 取出特征值\n",
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.数据划分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.1, random_state=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "transfer  = StandardScaler()\n",
    "X_train = transfer.fit_transform(X_train)\n",
    "X_test = transfer.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.模型构建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "estimator  = LogisticRegression(penalty='l2',solver='newton-cg',multi_class='multinomial')\n",
    "estimator.fit(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.模型评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "print(\"\\n得出来的权重:\", estimator.coef_)\n",
    "print(\"\\nLogistic Regression模型训练集的准确率:%.1f%%\" %(estimator.score(X_train, Y_train)*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. 模型预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "y_predict = estimator.predict(X_test)\n",
    "print(\"\\n预测结果为:\\n\", y_predict)\n",
    "print(\"\\n比对真实值和预测值:\\n\", y_predict == Y_test)\n",
    "\n",
    "# 预测的准确率\n",
    "accuracy = metrics.accuracy_score(Y_test, y_predict)\n",
    "print(\"\\nLogistic Regression 模型测试集的正确率:%.1f%%\" %(accuracy*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.交叉验证"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "import numpy as np\n",
    "scores = cross_val_score(estimator, X, y, scoring=None, cv=10)  #cv为迭代次数。\n",
    "print(\"\\n交叉验证的准确率:\",np.round(scores,2))  # 打印输出每次迭代的度量值(准确度)\n",
    "print(\"\\n交叉验证结果的置信区间: %0.2f%%(+/- %0.2f)\" % (scores.mean()*100, scores.std() * 2))  # 获取置信区间。(也就是均值和方差)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}