大数据-Pandas

发表于 2020-01-10 更新于 2024-08-23 分类于 Python Python教程机器学习本文字数： 5.2k 阅读时长 ≈ 9 分钟

pandas是数据分析的一个核心框架，集成了数据结构化和数据清洗以及分析的一些方法。
pandas在numpy的基础上新增了三个数据类型：Series、DataFrame、Panel。

Series

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签
引入：

import numpy as np
import pandas as pd
# 引入Series
from pandas import Serie

Series的创建：两种创建方式

（1）由列表或numpy数组创建：默认索引为0到N-1的整数型索引

nd = np.array([1,2,3,4])
nd
s = Series(nd) # 没有指定索引默认0~N-1
s
"""
结果：
0    1
1    2
2    3
3    4
dtype: int32
"""
s = Series([1,2,3,4,5],index=list("abcde"))
s
"""
结果：
a    1
b    2
c    3
d    4
e    5
dtype: int64
"""
s["a"] # 结果：1
s = Series([1,2,3,4,5,6], index=["A","A","B","B","A","C"])
s
"""
结果：
A    1
A    2
B    3
B    4
A    5
C    6
dtype: int64nt64
"""
s["A"]
"""
结果：
A    1
A    2
A    5
dtype: int64
"""

（2）由字典创建

s = Series({"a":1,"b":2,"c":3})
s
"""
结果：
a    1
b    2
c    3
dtype: int64
"""
s = Series({"a":123,"b":456},index=list("ac"))
s
"""
结果：
a    123.0
c      NaN
dtype: float64
"""

（3）练习：

使用多种方法创建以下Series，命名为s1：
语文 150
数学 150
英语 150

nd = np.array([150,150,150,300])
s1 = Series(nd,index=["语文","数学","英语","理综"])
s1 # 由数组和列表创建的Series是一个浅拷贝(只拷贝引用地址，不拷贝对象本身)
"""
结果：
语文    150
数学    150
英语    150
理综    300
dtype: int32
"""
dic = {"语文":150,"数学":150,"英语":150,"理综":300}
s2 = Series(dic)
s2 # 由字典创建Serise是一个创建副本的过程(也叫深拷贝)
"""
结果：
语文    150
数学    150
英语    150
理综    300
dtype: int64
"""

Series的索引和切片

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的仍然是一个Series类型）。分为显式索引和隐式索引：

（1）显式索引：

使用index中的元素作为索引值
使用.loc[]（推荐）

注意，此时是闭区间

s.values # array([1, 2, 3], dtype=int64)
s.index # Index(['a', 'b', 'c'], dtype='object')
# 方式1
s["a"]
# 方式2(推荐)
s.loc["a"]
s.loc["a","b"] # 不能写成这种形式,IndexingError: Too many indexers
s.loc[["a","b","c"]] # 通过列表来查找，实际上就是从s中截取子serise
"""
a    1
b    2
c    3
dtype: int64
"""

（2）隐式索引：

使用整数作为索引值
使用.iloc[]（推荐）

注意，此时是半开区间

s2.iloc[0] # 150
s2.iloc[0,1] # IndexingError: Too many indexers
s2.iloc[[0,1]]
"""
语文    150
数学    150
dtype: int64
"""

（3）切片

# 显式
s.loc["a":"c"] # 闭区间
"""
a    1
b    2
c    3
dtype: int64
"""
# 隐式
s.iloc[0:2] # 前闭后开
"""
a    1
b    2
dtype: int64
"""
s = Series([1,2,3,4,5,6], index=["A","A","B","C","B","C"])
s
s.loc["A":"B"] # KeyError: "Cannot get right slice bound for non-unique label: 'B'"
# 如果显式索引中有重复的不建议用显式索引来切片

（4）练习

1
2
3

使用多种方法对练习1创建的Series s1进行索引和切片：
索引： 数学 150
切片： 语文 150 数学 150 英语 150

# 索引
s1[[1]]
s1.loc[["数学"]]
# 切片
s1.loc["语文":"英语"]
s1.iloc[0:3]

Series的基本概念

可以把Series看成一个定长的有序字典
可以通过shape，size，index,values等得到series的属性

s.shape # (6,)
s.reshape((3,2)) # 一般不对Serise进行reshape操作，会改变原来数据的形式
s.size # 6
s.index # Index(['A', 'A', 'B', 'C', 'B', 'C'], dtype='object')
s.index[[0,1]] # Index(['A', 'A'], dtype='object')
s.index[0:2] # Index(['A', 'A'], dtype='object')
s.index = list("abcdef")
s
"""
a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64
"""
s.index[0] = "A" # index的值不允许单个修改

可以通过head(),tail()快速查看Series对象的样式

# 把数据读入
data = pd.read_csv("./titanic.txt")
data
data.head() # 前5个
data.tail() # 后5个
data.head(3) # 前3条数据
data.tail(3) # 后3条数据
# DataFrame是由Series构成的
data["name"]

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况

s = Series({"a":123,"b":456,"c":789},index=list("abcdef"))
s
"""
a    123.0
b    456.0
c    789.0
d      NaN
e      NaN
f      NaN
dtype: float64
"""

可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

s.isnull()
"""
a    False
b    False
c    False
d     True
e     True
f     True
dtype: bool
"""

pd.notnull(s)
"""
a     True
b     True
c     True
d    False
e    False
f    False
dtype: bool
"""

ind = s.isnull()
ind
s[ind] # 索引对应的值为True则输出，输出所有缺失的数据

s[s.notnull()]

s1 = Series({"a":True,"b":False,"c":True,"d":False,"e":True,"f":True})
s1
"""
a     True
b    False
c     True
d    False
e     True
f     True
dtype: bool
"""
s[s1] # 一个Series如果和另外一个完全一致，值是bool类型，这个Serise是可以作为另一个Series的索引来查找对应元素的，查找的结果就是所有bool为True的那些索引对应的元素
"""
a    123.0
c    789.0
e      NaN
f      NaN
dtype: float64
"""
s[s.isnull()] = 1000
s # 给所有缺失的元素赋值
"""
a     123.0
b     456.0
c     789.0
d    1000.0
e    1000.0
f    1000.0
dtype: float64
"""

Series对象本身及其实例都有一个name属性

s.name = "Python"
s # name属性是serise在DataFrame中的表头信息
"""
a     123.0
b     456.0
c     789.0
d    1000.0
e    1000.0
f    1000.0
Name: Python, dtype: float64
"""

Series的运算

（1）适用于numpy的数组运算也适用于Series
1
（2）Series之间的运算
- 在运算中自动对齐不同索引的数据
- 如果索引不对应，则补NaN
- 注意：要想保留所有的index，则需要使用.add()函数
1

Series

Series的创建：两种创建方式

Series的索引和切片

Series的基本概念

Series的运算