{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas 基础"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在这节课中,我们将会学习到 Pandas 的相关内容。\n",
"\n",
"\n",
"`Pandas` 基于 `NumPy` 创建,并纳入了大量库及一些标准的数据模型, \n",
"提供了大量能使我们快速便捷地处理数据的函数与方法,可以高效的操作大型数据集。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": true
},
"source": [
"## 产生 Pandas 对象"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas` 中有三种基本结构:\n",
"\n",
"- `Series`\n",
" - 1D labeled homogeneously-typed array\n",
"- `DataFrame`\n",
" - General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns\n",
"- `Panel`\n",
" - General 3D labeled, also size-mutable array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一维 `Series` 可以用一维列表初始化:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series([1,3,5,np.nan,6,8])\n",
"print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"默认情况下,`Series` 的下标都是数字(可以使用额外参数指定),类型是统一的。\n",
"\n",
"### DataFrame\n",
"\n",
"`DataFrame` 则是个二维结构,这里首先构造一组时间序列,作为我们第一维的下标:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dates = pd.date_range('20130101', periods=6)\n",
"print(dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"然后创建一个 `DataFrame` 结构:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"默认情况下,如果不指定 `index` 参数和 `columns`,那么他们的值将用从 `0` 开始的数字替代。\n",
"\n",
"除了向 `DataFrame` 中传入二维数组,我们也可以使用字典传入数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2 = pd.DataFrame({'A' : 1.,\n",
" 'B' : pd.Timestamp('20130102'),\n",
" 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),\n",
" 'D' : np.array([3] * 4,dtype='int32'),\n",
" 'E' : pd.Categorical([\"test\",\"train\",\"test\",\"train\"]),\n",
" 'F' : 'foo' })\n",
"\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"字典的每个 `key` 代表一列,其 `value` 可以是各种能够转化为 `Series` 的对象。\n",
"\n",
"与 `Series` 要求所有的类型都一致不同,`DataFrame` 值要求每一列数据的格式相同:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": true
},
"source": [
"## 查看数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 头尾数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`head` 和 `tail` 方法可以分别查看最前面几行和最后面几行的数据(默认为 5):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"最后 3 行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.tail(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src='http://imgbed.momodel.cn/5cc1a0b8e3067ce9b6abf76f.jpg' width=16px height=16px> **编程练习**\n",
"\n",
"要求:查看df 的前 2 行数据。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head(2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 下标,列标,数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下标使用 `index` 属性查看:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"列标使用 `columns` 属性查看:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据值使用 `values` 查看:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 统计数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"查看简单的统计数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 转置"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 排序"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sort_index(axis=0, ascending=True)` 方法按照下标大小进行排序,`axis=0` 表示按第 0 维进行排序。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.sort_index(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.sort_index(axis=1, ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sort_values(by, axis=0, ascending=True)` 方法按照 `by` 的值的大小进行排序,例如按照 `B` 列的大小:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.sort_values(by=\"B\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src='http://imgbed.momodel.cn/5cc1a0b8e3067ce9b6abf76f.jpg' width=16px height=16px> **编程练习**\n",
"\n",
"要求:将 df 的按 A 列由大到小排序。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.sort_values(by='A', ascending=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"## 索引"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"虽然 `DataFrame` 支持 `Python/Numpy` 的索引语法,但是推荐使用 `.at, .iat, .loc, .iloc 和 .ix` 方法进行索引。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 读取数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"选择单列数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[\"A\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也可以用 `df.A`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.A"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用切片读取多行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[0:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`index` 名字也可以进行切片:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[\"20130101\":\"20130103\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用 `label` 索引"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`loc` 可以方便的使用 `label` 进行索引:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc[dates[0]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"多列数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc[:,['A','B']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"选择多行多列:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc['20130102':'20130104',['A','B']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据降维:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc['20130102',['A','B']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"得到标量值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc[dates[0],'B']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"不过得到标量值可以用 `at`,速度更快:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%timeit -n100 df.loc[dates[0],'B']\n",
"%timeit -n100 df.at[dates[0],'B']\n",
"\n",
"print(df.at[dates[0],'B'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用位置索引"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`iloc` 使用位置进行索引:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"连续切片:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[3:5,0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"索引不连续的部分:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[[1,2,4],[0,2]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"索引整行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[1:3,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"整列:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[:, 1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"标量值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iloc[1,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当然,使用 `iat` 索引标量值更快:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%timeit -n100 df.iloc[1,1]\n",
"%timeit -n100 df.iat[1,1]\n",
"\n",
"df.iat[1,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 布尔型索引"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"所有 `A` 列大于 0 的行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[df.A > 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"只留下所有大于 0 的数值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[df > 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 `isin` 方法做 `filter` 过滤:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2 = df.copy()\n",
"df2['E'] = ['one', 'one','two','three','four','three']\n",
"\n",
"df2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2[df2['E'].isin(['two','four'])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src='http://imgbed.momodel.cn/5cc1a0b8e3067ce9b6abf76f.jpg' width=16px height=16px> **编程练习**\n",
"\n",
"要求:删选 df A 列大于 -1 的数据,并只展示 B 列的数据。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[df.A > -1].B\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 设定数据的值"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))\n",
"\n",
"s1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"像字典一样,直接指定 `F` 列的值为 `s1`,此时以 `df` 已有的 `index` 为标准将二者进行合并,`s1` 中没有的 `index` 项设为 `NaN`,多余的项舍去:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['F'] = s1\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"或者使用 `at` 或 `iat` 修改单个值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.at[dates[0],'A'] = 0\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.iat[0, 1] = 0\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"设定一整列:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.loc[:,'D'] = np.array([5] * len(df))\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"设定满足条件的数值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2 = df.copy()\n",
"\n",
"df2[df2 > 0] = -df2\n",
"\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 缺失数据"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])\n",
"df1.loc[dates[0]:dates[1],'E'] = 1\n",
"\n",
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"丢弃所有缺失数据的行得到的新数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df1.dropna(how='any')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"填充缺失数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df1.fillna(value=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"检查缺失数据的位置:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.isnull(df1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"## 计算操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 统计信息"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"每一列的均值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"每一行的均值:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.mean(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"多个对象之间的操作,如果维度不对,`pandas` 会自动调用 `broadcasting` 机制:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"相减 `df - s`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.sub(s, axis='index')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src='http://imgbed.momodel.cn/5cc1a0b8e3067ce9b6abf76f.jpg' width=16px height=16px> **编程练习**\n",
"\n",
"要求:计算 df B 列的和。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['B'].sum()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### apply 操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"与 `R` 中的 `apply` 操作类似,接收一个函数,默认是对将函数作用到每一列上:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.apply(np.cumsum)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"求每列最大最小值之差:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.apply(lambda x: x.max() - x.min())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 直方图"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series(np.random.randint(0, 7, size=10))\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"直方图信息:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"绘制直方图信息:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"h = s.hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 字符串方法"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当 `Series` 或者 `DataFrame` 的某一列是字符串时,我们可以用 `.str` 对这个字符串数组进行字符串的基本操作: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])\n",
"\n",
"s.str.lower()"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"## 合并"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 连接"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(10, 4))\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以使用 `pd.concat` 函数将多个 `pandas` 对象进行连接:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"pieces = [df[:2], df[4:5], df[7:]]\n",
"\n",
"pd.concat(pieces)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据库中的 Join"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`merge` 可以实现数据库中的 `join` 操作:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})\n",
"right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})\n",
"\n",
"print(left)\n",
"print(right)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.merge(left, right, on='key')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### append"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"向 `DataFrame` 中添加行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将第三行的值添加到最后:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = df.iloc[3]\n",
"\n",
"df.append(s, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grouping"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',\n",
" 'foo', 'bar', 'foo', 'foo'],\n",
" 'B' : ['one', 'one', 'two', 'three',\n",
" 'two', 'two', 'one', 'three'],\n",
" 'C' : np.random.randn(8),\n",
" 'D' : np.random.randn(8)})\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"按照 `A` 的值进行分类:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('A').sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"按照 `A, B` 的值进行分类:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby(['A', 'B']).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"## 改变形状"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stack"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"产生一个多 `index` 的 `DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',\n",
" 'foo', 'foo', 'qux', 'qux'],\n",
" ['one', 'two', 'one', 'two',\n",
" 'one', 'two', 'one', 'two']]))\n",
"\n",
"index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])\n",
"df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`stack` 方法将 `columns` 变成一个新的 `index` 部分:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2 = df[:4]\n",
"\n",
"stacked = df2.stack()\n",
"\n",
"stacked"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以使用 `unstack()` 将最后一级 `index` 放回 `column`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"stacked.unstack()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也可以指定其他的级别:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"stacked.unstack(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 时间序列"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"金融分析中常用到时间序列数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')\n",
"ts = pd.Series(np.random.randn(len(rng)), rng)\n",
"\n",
"ts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"标准时间表示:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ts_utc = ts.tz_localize('UTC')\n",
"\n",
"ts_utc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"改变时区表示:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"ts_utc.tz_convert('US/Eastern')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categoricals"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"raw_grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以将 `grade` 变成类别:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[\"grade\"] = df[\"raw_grade\"].astype(\"category\")\n",
"\n",
"df[\"grade\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将类别的表示转化为有意义的字符:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[\"grade\"].cat.categories = [\"very good\", \"good\", \"very bad\"]\n",
"\n",
"df[\"grade\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"添加缺失的类别:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[\"grade\"] = df[\"grade\"].cat.set_categories([\"very bad\", \"bad\", \"medium\", \"good\", \"very good\"])\n",
"df[\"grade\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 `grade` 分组:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby(\"grade\").size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 绘图"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 `ggplot` 风格:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"plt.style.use('ggplot')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Series` 绘图:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n",
"\n",
"p = ts.cumsum().plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrame` 按照 `columns` 绘图:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,\n",
" columns=['A', 'B', 'C', 'D'])\n",
"\n",
"df.cumsum().plot()\n",
"p = plt.legend(loc=\"best\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"## 文件读写"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"写入文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.to_csv('foo.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"从文件中读取:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.read_csv('foo.csv').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### excel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"写入文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.to_excel('foo.xlsx', sheet_name='Sheet1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"读取文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"清理生成的临时文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import glob\n",
"import os\n",
"\n",
"for f in glob.glob(\"foo*\"):\n",
" os.remove(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src='http://imgbed.momodel.cn/5cc1a0b8e3067ce9b6abf76f.jpg' width=16px height=16px> **编程练习**\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],\n",
" 'age': [2.5, 5, 0.5, np.nan, 5, 2, 4.5, np.nan, 5, 3],\n",
" 'visits': [1, 4, 2, 3, 2, 3, 1, 2, 2, 1],\n",
" 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}\n",
"\n",
"labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']\n",
"df_test = pd.DataFrame(data, index=labels)\n",
"df_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. 要求: 取出 age 值大于3的行"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test[df_test['age'] > 3]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. 要求: 使用 iloc 方法展示前三行"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test.iloc[:3]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. 要求: 计算 visits 的总和。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 请编写你的答案\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test['visits'].sum()\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}