diff --git a/02.02 决策树(学生版).ipynb b/02.02 决策树(学生版).ipynb index 12fa846..626c938 100644 --- a/02.02 决策树(学生版).ipynb +++ b/02.02 决策树(学生版).ipynb @@ -11,7 +11,46 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "决策树是一种通过**树形结构**进行分类的方法。在决策树中,树形结构中每个节点表示对分类目标在属性上的一个判断,每个分支代表基于该属性做出的一个判断,最后树形结构中每个叶子结点代表一种分类结果。" + "决策树是一种通过**树形结构**进行分类的方法,使用层层推理来实现最终的分类。决策树由下面几种元素构成:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "决策树的组成元素有哪些?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "上面的说法过于抽象,下面来看一个实际的例子。构建一棵结构简单的决策树,用于预测贷款用户是否具有偿还贷款的能力。\n", + "\n", + "贷款用户主要具备三个属性:**是否拥有房产**,**是否结婚**,**平均月收入**。\n", + "\n", + "每一个内部节点都表示一个属性条件判断,叶子节点表示贷款用户是否具有偿还能力。\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "首先判断贷款用户是否拥有房产,如果用户拥有房产,则说明该用户具有偿还贷款的能力;否则需要判断该用户是否结婚,如果已经结婚则具有偿还贷款的能力;否则需要判断该用户的收入大小,如果该用户月收入小于 4K 元,则该用户不具有偿还贷款的能力,否则该用户是具有偿还能力的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "决策树的流程是什么?\n", + "\n", + "在有一个贷款用户A,其情况是月收入 3K、已经结婚、没有房产,那么他是否具有偿还贷款的能力呢? \n", + "\n", + "上图中我们为啥要用“是否拥有房产”作根节点呢?可不可以用“是否结婚”和“平均月收入”做根节点呢?" ] }, { @@ -49,14 +88,14 @@ "\n", "根据上表,绘制如图所示的决策树:\n", "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "第一层是天气状况,具有雨、多云和晴三种属性取值。\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "根节点是天气状况,具有雨、多云和晴三种属性取值。\n", "+ 多云: 样本子集是 { 3, 7, 12, 13 } ,仅有“前往游乐场游玩”一个类别,即肯定去游乐场。 \n", " \n", " \n", @@ -82,8 +121,15 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "把数据导入 DataFrame 数据结构:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -95,14 +141,182 @@ "import math\n", "from math import log\n", "import warnings\n", - "warnings.filterwarnings(\"ignore\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "warnings.filterwarnings(\"ignore\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
天气温度湿度是否有风是否前往游乐场
0>26>750
1<=26>750
2多云>26>751
3<=26>751
4<=26>751
5<=26<=750
6多云<=26<=751
7<=26>750
8<=26<=751
9<=26>751
10<=26<=751
11多云<=26>751
12多云>26<=751
13<=26>750
\n", + "
" + ], + "text/plain": [ + " 天气 温度 湿度 是否有风 是否前往游乐场\n", + "0 晴 >26 >75 否 0\n", + "1 晴 <=26 >75 是 0\n", + "2 多云 >26 >75 否 1\n", + "3 雨 <=26 >75 否 1\n", + "4 雨 <=26 >75 否 1\n", + "5 雨 <=26 <=75 是 0\n", + "6 多云 <=26 <=75 是 1\n", + "7 晴 <=26 >75 否 0\n", + "8 晴 <=26 <=75 否 1\n", + "9 雨 <=26 >75 否 1\n", + "10 晴 <=26 <=75 是 1\n", + "11 多云 <=26 >75 是 1\n", + "12 多云 >26 <=75 否 1\n", + "13 雨 <=26 >75 是 0" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# 原始数据\n", "datasets = [\n", @@ -148,9 +362,9 @@ "source": [ "## 2.2.2 构建决策树 \n", "\n", - "**信息增益**用来衡量样本集合复杂度(不确定性)所减少的程度。 \n", - "\n", - "**信息熵**用来度量信息量的大小。从信息论的角度来看,对信息的度量等于计算信息不确定性的多少。 " + "**信息增益**是什么?\n", + "\n", + "**信息熵**是什么?" ] }, { @@ -183,16 +397,16 @@ " if ent == 0:\n", " ent = 0\n", " # 返回信息熵,精确到小数点后 4 位\n", - " return round(ent, 4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "现在用**熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场( 9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n", + " return round(ent, 4)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "现在用**信息熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场 (9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n", "\n", "记 “游客来游乐场” 和 “游客不来游乐场” 的概率分别为 $p_1$ 和 $p_2$ ,显然 $p_1=\\frac{9}{14}$,$p_1=\\frac{5}{14}$,则这 14 个样本所蕴含的信息熵:\n", "\n", @@ -203,17 +417,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "我们可以用下面这种方式对 dataframe 的数据按条件进行筛选。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# 例如:按 是否前往游乐场==0 进行筛选\n", - "df[df['是否前往游乐场']=='0']" + "我们可以用下面这种方式对 DataFrame 的数据按条件进行筛选。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 例如:按 是否前往游乐场 == 0 进行筛选\n", + "df[df['是否前往游乐场']=='0']\n" ] }, { @@ -235,7 +449,7 @@ "count_dict = {'前往':df[df['是否前往游乐场']=='1'].shape[0], '不前往':df[df['是否前往游乐场']=='1'].shape[1]}\n", "# 计算信息熵\n", "entropy = calc_entropy(total_num, count_dict)\n", - "entropy" + "entropy\n" ] }, { @@ -279,7 +493,7 @@ "outputs": [], "source": [ "# 筛选出 天气为晴并且去游乐场的样本数据\n", - "df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]" + "df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]\n" ] }, { @@ -290,10 +504,12 @@ "source": [ "# 天气为晴的总天数\n", "total_num_sun = df[df['天气']=='晴'].shape[0]\n", + "\n", "# 天气为晴时,去游乐场和不去游乐场的人数\n", - "count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0], \n", - " '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n", + "count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0],\n", + " '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n", "print(count_dict_sun)\n", + "\n", "# 计算天气-晴 的信息熵\n", "ent_sun = calc_entropy(total_num_sun, count_dict_sun)\n", "print('天气-晴 的信息熵为:%s' % ent_sun)\n" @@ -307,13 +523,15 @@ "source": [ "# 天气为多云的总天数\n", "total_num_cloud = df[df['天气']=='多云'].shape[0]\n", + "\n", "# 天气为多云时,去游乐场和不去游乐场的人数\n", - "count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0], \n", - " '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n", + "count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0],\n", + " '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n", "print(count_dict_cloud)\n", + "\n", "# 计算天气-多云 的信息熵\n", "ent_cloud = calc_entropy(total_num_cloud, count_dict_cloud)\n", - "print('天气-多云 的信息熵为:%s' % ent_cloud)" + "print('天气-多云 的信息熵为:%s' % ent_cloud)\n" ] }, { @@ -324,13 +542,15 @@ "source": [ "# 天气为雨的总天数\n", "total_num_rain = df[df['天气']=='雨'].shape[0]\n", + "\n", "# 天气为雨时,去游乐场和不去游乐场的人数\n", - "count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0], \n", - " '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n", + "count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0],\n", + " '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n", "print(count_dict_rain)\n", + "\n", "# 计算天气-雨 的信息熵\n", "ent_rain = calc_entropy(total_num_rain, count_dict_rain)\n", - "print('天气-雨 的信息熵为:%s' % ent_rain)" + "print('天气-雨 的信息熵为:%s' % ent_rain)\n" ] }, { @@ -356,17 +576,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "假设有 $K$ 个信息,其组成了集合样本 $D$ ,记第 $k$ 个信息发生的概率为$P_k(1≤k≤K)$。 \n", - "这 $K$ 个信息的信息熵: \n", - "$$E(D)=-\\sum_{k=1}^{K}p_k log_{2} p_k$$\n", - "\n", - "需要指出:**所有 $p_k$ 累加起来的和为1**。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "使用上面的公式计算信息增益。" ] }, @@ -377,10 +586,10 @@ "outputs": [], "source": [ "# 信息增益\n", - "gain = entropy - (total_num_sun/total_num*ent_sun + \n", - " total_num_cloud/total_num*ent_cloud + \n", + "gain = entropy - (total_num_sun/total_num*ent_sun +\n", + " total_num_cloud/total_num*ent_cloud +\n", " total_num_rain/total_num*ent_rain)\n", - "gain" + "gain\n" ] }, { @@ -478,7 +687,7 @@ "# 查看 label\n", "print(list(iris.target_names))\n", "# 查看 feature\n", - "print(iris.feature_names)" + "print(iris.feature_names)\n" ] }, { @@ -507,7 +716,7 @@ "# 载入数据\n", "X, y = load_iris(return_X_y=True)\n", "# 切分训练集合测试集\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)" + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n" ] }, { @@ -528,7 +737,7 @@ "# 初始化模型,可以调整 max_depth 来观察模型的表现\n", "clf = tree.DecisionTreeClassifier(random_state=42, max_depth=2)\n", "# 训练模型\n", - "clf = clf.fit(X_train, y_train)" + "clf = clf.fit(X_train, y_train)\n" ] }, { @@ -548,13 +757,13 @@ "feature_names = ['萼片长度','萼片宽度','花瓣长度','花瓣宽度']\n", "target_names = ['山鸢尾', '杂色鸢尾', '维吉尼亚鸢尾']\n", "# 可视化生成的决策树\n", - "dot_data = tree.export_graphviz(clf, out_file=None, \n", - " feature_names=feature_names, \n", - " class_names=target_names, \n", - " filled=True, rounded=True, \n", - " special_characters=True) \n", - "graph = graphviz.Source(dot_data) \n", - "graph " + "dot_data = tree.export_graphviz(clf, out_file=None,\n", + " feature_names=feature_names,\n", + " class_names=target_names,\n", + " filled=True, rounded=True,\n", + " special_characters=True)\n", + "graph = graphviz.Source(dot_data)\n", + "graph\n" ] }, { @@ -572,7 +781,7 @@ "source": [ "from sklearn.metrics import accuracy_score\n", "y_test_predict = clf.predict(X_test)\n", - "accuracy_score(y_test,y_test_predict)" + "accuracy_score(y_test,y_test_predict)\n" ] }, { @@ -610,7 +819,7 @@ " # 读取每一行的内容\n", " for line in f.readlines():\n", " contents += line\n", - " return contents" + " return contents\n" ] }, { @@ -679,25 +888,25 @@ " ent = -sum([(p / word_number) * log(p / word_number, 2) for p in\n", " word_counter.values()])\n", " print('信息熵为:%.2f' % ent)\n", - " return ent" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ent = cal_essay_entropy(ch_essay)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ent = cal_essay_entropy(en_essay, split_by = ' ')" + " return ent\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ent = cal_essay_entropy(ch_essay)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ent = cal_essay_entropy(en_essay, split_by = ' ')\n" ] }, { diff --git a/02.02 决策树.ipynb b/02.02 决策树.ipynb index 70bc26e..233db67 100644 --- a/02.02 决策树.ipynb +++ b/02.02 决策树.ipynb @@ -33,7 +33,7 @@ "贷款用户主要具备三个属性:**是否拥有房产**,**是否结婚**,**平均月收入**。\n", "\n", "每一个内部节点都表示一个属性条件判断,叶子节点表示贷款用户是否具有偿还能力。\n", - "\n", + "\n", "\n" ] }, @@ -83,7 +83,7 @@ "\n", "根据上表,绘制如图所示的决策树:\n", "\n", - "" + "" ] }, { @@ -113,6 +113,13 @@ "1. 选择一个属性值;\n", "2. 基于该属性对样本集进行划分;\n", "3. 重复步骤 1 和 2 直到最后所得划分结果中每个样本为同一类别。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "首先我们读取数据" ] }, { @@ -560,17 +567,6 @@ "\n", "同理可以计算温度高低、湿度大小、风力强弱三个气象特点的信息增益。 \n", "通常情况下,某个分支的信息增益越大,则该分支对样本集划分所获得的“纯度”越大,信息不确定性减少的程度越大。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "假设有 $K$ 个信息,其组成了集合样本 $D$ ,记第 $k$ 个信息发生的概率为$P_k(1≤k≤K)$。 \n", - "这 $K$ 个信息的信息熵: \n", - "$$E(D)=-\\sum_{k=1}^{K}p_k log_{2} p_k$$\n", - "\n", - "需要指出:**所有 $p_k$ 累加起来的和为1**。" ] }, { diff --git a/02.03 回归分析(学生版).ipynb b/02.03 回归分析(学生版).ipynb index b4ff25b..706c7d3 100644 --- a/02.03 回归分析(学生版).ipynb +++ b/02.03 回归分析(学生版).ipynb @@ -59,8 +59,6 @@ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", - "!mkdir -p ~/.keras/datasets\n", - "!cp ./mnist.npz ~/.keras/datasets/mnist.npz\n", "\n", "x = np.array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005])\n", "y = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n", diff --git a/data_sample.jpg b/data_sample.jpg deleted file mode 100644 index e6c7c79..0000000 Binary files a/data_sample.jpg and /dev/null differ diff --git a/model.jpg b/model.jpg deleted file mode 100644 index b7a935b..0000000 Binary files a/model.jpg and /dev/null differ