Python数据分析师2020特训营-Pandas DataFrame 2

智汇君2025-01-17

Python数据分析师2020特训营-Pandas DataFrame 2

DataFrame

数据筛选

order = pd.read_excel('xxx.xlsx')
order.head(5)
order.tail(5)
order.columns
order.dtypes
order.ndim #维度为2维
order.shape # (2779,19)
order.size # 52801

loc，iloc

loc[A,B]
iloc[A,B]

A代表行，B代表列

data = [[1, 'Joe', 70000, 3], [2, 'Henry', 80000, 4], [3, 'Sam', 60000, ], [4, 'Max', 90000, None]]

employee = pd.DataFrame(data, columns=['id', 'name', 'salary', 'managerId']).astype({'id':'Int64', 'name':'object', 'salary':'Int64',  'managerId':'Int64'})
# 行标签没有设置，默认为0,1...可以用loc，iloc
    
print(employee.loc[1:3]) 
# loc此时可以用行标签取,行标签为0,1...(左闭又闭)。iloc可以用行索引(左闭右开)

如果有index=['a','b','c','d']
# loc此时只能用行标签a，b..取(左闭又闭)，iloc只能用行索引(左闭右开)

按条件筛选行

1	order.loc[order['order_id']==458,['order_id','dishes_name','logicprn_name']]

条件查询

单条件

先条件，后指定列
order[order['order_id']==458]['order_id','dishes_name']
等同于
order['order_id','dishes_name'][order['order_id']==458]
先指定列，后条件
等同于
order.loc[order['order_id']==458,['order_id','dishes_name']]

iloc不行

多条件 & | ~

多个条件需要用括号分别括起来
order[['order_id','dishes_name']][(order['order_id']==458)&(order['amounts']>3)]

order[['order_id','dishes_name']][(order['order_id']==458)|(order['amounts']>3)]

order[['order_id','dishes_name']][~(order['order_id']==458)]

.between

order['amounts'].between(10,30,inclusive=True)
# 会返回一个series，包含True，False

order[['order_id','dishes_name','amounts']][order['amounts'].between(10,30,inclusive=True)] # 左闭又闭

.isin

1
2
3

[order['dishes_name'].isin(['内蒙古烤羊腿','xxx'])

order[['dishes_id','dishes_name']][order['dishes_name'].isin(['内蒙古烤羊腿','xxx'])]

字符串 .str

1	用字符串的方法之前需要先.str

1
2
3

order['dishes_name'].str.contains('烤')

order[['dishes_id','dishes_name']][order['dishes_name'].str.contains('烤')]

增删改查

增

增加列
order['payment']=order['amounts']*order['counts']

order['payway'] = '现金支付'

添加列到某一个位置
mid = order['emp_id']
order.drop(['emp_id'],axis=1,inplace=True)
order.insert(0,'emp_id',mid)

删

1
2
3

order.drop(labels=['payway','payment'],axis=1) # inplace=False默认

del order['xxx'] 会作用到元数据，只能一列一列删

改

修改符合条件行的列值

1	order.loc[order['order_id']==458,'order_id']=45800

rename

改列名
order.rename(columns={'amounts':'金额'}，inplace=True)

改行标签
order.rename(index={0:'0900'}，inplace=True)

查

describe()

order.describe() 返回各个列的统计数据

order.describe().loc['count']==0 #取行后打印其实还是竖着的，所以还是series
返回True False series

c1 True
c2 False
c3 False
...

数据清洗

参考

1	数据分析、数据挖掘、机器学习、模型开发流程中数据清洗会占据大部分时间

[参考 2020Python数据分析师特训营全套课程84节完结版]

数据清洗-重复值处理

删除法

duplicated()

1
2
3

重复值一般采取删除法来处理

但有些重复值不能制除，例如订单明细数据或交易明细数据等(这些数据在业务上是有意义的)

import os
os.chdir('xxx\')
df = pd.read_excel('xxx.xlsx',sheet_name='xx')  # read_csv('')
df.duplicated() # 返回的是Series 0 True
np.sum(df.duplicated()) # 计算有多少行是重复的(没有算上第一次或者最后一次的值)

1 2	df.duplicated(subset=['col1','col3'],keep='first') # subset默认为全部列 #last first为默认值 False:不保留

drop_duplicates()

1	df.drop_duplicates(subset=['comment','install']) # 默认为first keep

数据清洗-缺失值处理

1	参数inplace是否设置为：inplace=True，代表原数据结构里的数据会被修改，否则返回一个新的数据结构

删除

1	数据分析中很少遇到删除某一列，一般都是删除行

isnull

df=pd.read_excel('xxx',,sheet_name='xx')
np.sum(df.isnull(),axis=1) #默认axis=0，按行计算，也就是每一列在所有行上的情况

df.apply(lambda x: sum(x.isnull())/len(x),axis=0)

np.sum(df.isnull(),axis=1) 

0   0
1   1
2   1
3   0
...

np.sum(df.isnull(),axis=0) 

col1  0
col2  2
col3  1
...

dropna

df.dropna() # 只要行上有缺失值，把整行删除
df.dropna(how='any',axis=1) # 只要列上有缺失值，把整列删除

df.dropna(how='all',axis=1) # 只要整列为缺失值，才删除

1	df.dropna(subset=['c1','c2'],how='any')

drop

1 2	删除某一列或者某一行 df.drop(labels=['c1','c2'],axis=1)

填补

1 2	df.fillna(20) # 所有缺失值都填充为20 df.age.fillna(20) # 仅age这一列的缺失值填充为20

均值

1	df.age.fillna(df.age.mean())

1	用均值填补空缺值，有时候容易受异常值影响，异常值过大或者过小时。这时一般会用中位数填补

中位数

1	df.age.fillna(df.age.median())

众数

1
2
3

对于像性别这种分类特征或者字符特征的缺失，一般采用众数填补

df.gender.fillna(df.gender.mode()[0]) #众数又时可能有多个，一般取第一

对不同类型的各列单独设置

1	df.fillna(value={'gender':df.gender.mode()[0],'age':df.age.mean(),'income':df.income.median()})

前向填补

1
2
3

df.fillna(method='ffill')

当第一行有缺失，则第一行不会变

后向填补

1
2
3

df.fillna(method='bfill')

当最后一行有缺失，则最后一行不会变

插补法(插值填充) 比较少用

1
2
3

df.age.interpolate()

df.age.interpolate(method='polynomial',order=2)

数据清洗-异常值处理

1
2
3

指那些偏离正常范围的值，不是销误值
异常值出现频率较低，但又会对实际项目分析造成偏差
异常值往往采取盖帽法或者数据离散化

异常值判断

均值
标准差

均值+-2/2.5/3倍标准差

箱线图

上四分位数：在数据排序后，上四分位数位于75%的位置
下四分位数：是将数据从小到大排序后位于第25百分位的数值
中位数
上界 下界

盖帽法

均值、标准差

import os
os.chdir('xxx')
sunspots = pd.read_csv('xxx.csv',seq=',')

    year counts
0   1999   2
1   2000   5
...

xbar = sunspots.counts.means()
xstd = sunspots.counts.std()
sunspots.counts > xbar+2*std # 170
sunspots.counts < xbar-2*std # -30

如果数据太多可以用这种方式查看是否有True的：
any(sunspots.counts > xbar+2*std)
>>True
any(sunspots.counts < xbar-2*std)
>>False

1 2	sunspots.counts.plot() sunspots.counts.plot(kind='hist') # 分布图

箱线法

Q1=sunspots.counts.quantile(q=0.25) # 下四分位数

Q3=sunspots.counts.quantile(q=0.75) # 上四分位数
IQR = Q3-Q1 # 分位差

any(sunspots.counts>Q3+1.5IQR)
>>True
any(sunspots.counts<Q1-1.5IQR)
>>False

1	sunspots.counts.plot(kind='box')

异常值处理

1	异常值一般都是进行替换、也有进行直接删除或者替换为空

替换法

sunspots.counts.describe() # 简单分析counts这一列，包括25% 75% 50% max min std mean

UL = Q3+1.5*IQR # 上界
replace_value = sunspots.counts[sunspots.counts<UL].max() # 小于上界的最大值
sunspots.loc[sunspots.counts>UL,'counts'] = replace_value # 超过上界的改为替代值

分位数

P1 = sunspots.counts.quantile(0.01)
P99 = sunspots.counts.quantile(0.99)
sunspots['counts_new']=sunspots['counts']
sunspots.loc[sunspots['counts']>P99,'counts_new']=P99
sunspots.loc[sunspots['counts']<P1,'counts_new']=P1

删除

1	df = df.drop(index=df[df['age'] > 60].index,axis=0)

修改为空

1 2	# df['age'][df['age'] > 60] = ''#会报错：SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame df.loc[df['age'] > 60,'age'] = ''

离散化

1 2	数据离散化就是分箱一般常用分箱方法是等频(样本个数一样)或者等宽分段(区间范围长度一样)

1 2	sunspots=pd.readcsv('xxx.csv') sunspots['counts_bin'] = pd.cut(sunspots['counts'],4,labels=range(1,5))

等宽分段

sunspots['counts_bin']=pd.cut(sunspots['counts'],4,labels=range(1.5))

# labels不写的话，下面打印的话标签位置是区间

sunspots['counts_bin'].value_counts()

1 171
2 78
3 33
4 7

sunspots['counts_bin'].hist() #分布图

1	等宽分段容易受异常值影响，导致有的段值很少，有的很多

等频分段

利用分位数分段

k=4
w=[i/4 for i in range(5)]
sunspots['counts_bin']=pd.qcut(sunspots.counts,q=w,labels=range(0,4)) 

# labels不加的话是区间范围
0 (-0.001,15.6]
1 (-0.001,15.6]
2 (15.6,39.0]

sunspots['counts_bin'].hist()

1	等宽分段这里不是等高的

还可以先将分位比例处的分位数计算出来

k=4
w1=sunspots['counts'].quantile([i/k for i in range(k+1)])

0 0
0.25 15.6
0.5 39
0.75 68.9
1 190.2

sunspots['counts_bin']=pd.cut(sunspots['counts'],w1,labels=range(0,4))