Pandas知识点-详解行列级批处理函数apply
先看一个例子:
# coding=utf-8
import pandas as pd
df = pd.DataFrame({'Col-1': [1, 3, 5], 'Col-2': [2, 4, 6], 'Col-3': [9, 8, 7], 'Col-4': [3, 6, 9]},
index=['A', 'B', 'C'])
print(df)
df_new = df.apply(lambda x: x-1)
print('-' * 30, '\n', df_new, sep='')
Col-1 Col-2 Col-3 Col-4
A 1 2 9 3
B 3 4 8 6
C 5 6 7 9
------------------------------
Col-1 Col-2 Col-3 Col-4
A 0 1 8 2
B 2 3 7 5
C 4 5 6 8
apply用法和参数介绍
apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwds):
func: 应用于每一列或每一行的函数,这个函数可以是Python内置函数、Pandas或其他库中的函数、自定义函数、匿名函数。
axis: 设置批处理函数按列还是按行应用,0或index表示按列应用函数,1或columns表示按行应用函数,默认值为0。
raw: 设置将列/行作为Series对象传递给函数,还是作为ndarray对象传递给函数。raw是bool类型,默认为False。
result_type: 当axis=1时,设置返回结果的类型和样式,支持{'expand', 'reduce', 'broadcast', None}四种类型,默认为None。
args: 传给应用函数func的位置参数,args接收的数据类型为元组,如果只有一个位置参数要注意加逗号。
**kwds: 如果func中有关键字参数,可以传给**kwds。
传入不同类型的函数
import numpy as np
df = pd.DataFrame({'Col-1': [1, 3, 5], 'Col-2': [2, 4, 6], 'Col-3': [9, 8, 7], 'Col-4': [3, 6, 9]},
index=['A', 'B', 'C'])
print(df)
df1 = df.apply(max) # python内置函数
print('-' * 30, '\n', df1, sep='')
df2 = df.apply(np.mean) # numpy中的函数
print('-' * 30, '\n', df2, sep='')
df3 = df.apply(pd.DataFrame.min) # pandas中的方法
print('-' * 30, '\n', df3, sep='')
Col-1 Col-2 Col-3 Col-4
A 1 2 9 3
B 3 4 8 6
C 5 6 7 9
------------------------------
Col-1 5
Col-2 6
Col-3 9
Col-4 9
dtype: int64
------------------------------
Col-1 3.0
Col-2 4.0
Col-3 8.0
Col-4 6.0
dtype: float64
------------------------------
Col-1 1
Col-2 2
Col-3 7
Col-4 3
dtype: int64
def make_ok(s):
return pd.Series(['{}ok'.format(d) for d in s])
df4 = df.apply(make_ok) # 自定义函数
print('-' * 30, '\n', df4, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1ok 2ok 9ok 3ok
1 3ok 4ok 8ok 6ok
2 5ok 6ok 7ok 9ok
设置按行还是按列
def make_ok(s):
if isinstance(s, pd.Series):
if s.name in df.columns:
return pd.Series(['{}ok-列'.format(d) for d in s])
else:
return pd.Series(['{}ok-行'.format(d) for d in s])
else:
return '{}ok'.format(s)
df5 = df.apply(make_ok, axis=0) # 按列处理
print('-' * 30, '\n', df5, sep='')
df6 = df.apply(make_ok, axis=1) # 按行处理
print('-' * 30, '\n', df6, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1ok-列 2ok-列 9ok-列 3ok-列
1 3ok-列 4ok-列 8ok-列 6ok-列
2 5ok-列 6ok-列 7ok-列 9ok-列
------------------------------
0 1 2 3
A 1ok-行 2ok-行 9ok-行 3ok-行
B 3ok-行 4ok-行 8ok-行 6ok-行
C 5ok-行 6ok-行 7ok-行 9ok-行
函数func的参数
def yes_or_no(s, answer):
if answer != 'yes' and answer != 'no':
answer = 'yes'
if isinstance(s, pd.Series):
return pd.Series(['{}-{}'.format(d, answer) for d in s])
else:
return '{}-{}'.format(s, answer)
df7 = df.apply(yes_or_no, args=('yes',))
df7.index = ['A', 'B', 'C']
print('-' * 30, '\n', df7, sep='')
df8 = df.apply(yes_or_no, args=('no',))
print('-' * 30, '\n', df8, sep='')
df9 = df.apply(yes_or_no, args=(0,))
print('-' * 30, '\n', df9, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
A 1-yes 2-yes 9-yes 3-yes
B 3-yes 4-yes 8-yes 6-yes
C 5-yes 6-yes 7-yes 9-yes
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1-no 2-no 9-no 3-no
1 3-no 4-no 8-no 6-no
2 5-no 6-no 7-no 9-no
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1-yes 2-yes 9-yes 3-yes
1 3-yes 4-yes 8-yes 6-yes
2 5-yes 6-yes 7-yes 9-yes
传入多个函数进行聚合
df10 = df.apply([np.max, np.min])
print('-' * 40, '\n', df10, sep='')
df11 = df.apply({'Col-1': np.mean, 'Col-2': np.min})
print('-' * 40, '\n', df11, sep='')
df12 = df.apply({'Col-1': [np.mean, np.median], 'Col-2': [np.min, np.mean]})
print('-' * 40, '\n', df12, sep='')
----------------------------------------
Col-1 Col-2 Col-3 Col-4
amax 5 6 9 9
amin 1 2 7 3
----------------------------------------
Col-1 3.0
Col-2 2.0
dtype: float64
----------------------------------------
Col-1 Col-2
mean 3.0 4.0
median 3.0 NaN
amin NaN 2.0
通过函数名字符串调用函数
df13 = df.apply('mean', axis=1)
print('-' * 30, '\n', df13, sep='')
df14 = df.apply(['mean', 'min'], axis=1)
print('-' * 30, '\n', df14, sep='')
------------------------------
A 3.75
B 5.25
C 6.75
dtype: float64
------------------------------
mean min
A 3.75 1.0
B 5.25 3.0
C 6.75 5.0
修改DataFrame本身
df15 = df.copy()
# 读取df的一列,将处理结果添加到原df中,增加一列
df15['Col-x'] = df15['Col-1'].apply(make_ok)
print('-' * 40, '\n', df15, sep='')
# 读取df的一行,将处理结果添加到原df中,增加一行
df15.loc['Z'] = df15.loc['A'].apply(yes_or_no, args=('yes',))
print('-' * 40, '\n', df15, sep='')
----------------------------------------
Col-1 Col-2 Col-3 Col-4 Col-x
A 1 2 9 3 1ok
B 3 4 8 6 3ok
C 5 6 7 9 5ok
----------------------------------------
Col-1 Col-2 Col-3 Col-4 Col-x
A 1 2 9 3 1ok
B 3 4 8 6 3ok
C 5 6 7 9 5ok
Z 1-yes 2-yes 9-yes 3-yes 1ok-yes
Series使用apply
s0 = df['Col-2'].apply(make_ok)
print('-' * 20, '\n', s0, sep='')
s = pd.Series(range(5), index=[alpha for alpha in 'abcde'])
print('-' * 20, '\n', s, sep='')
s1 = s.apply(make_ok)
print('-' * 20, '\n', s1, sep='')
--------------------
A 2ok
B 4ok
C 6ok
Name: Col-2, dtype: object
--------------------
a 0
b 1
c 2
d 3
e 4
dtype: int64
--------------------
a 0ok
b 1ok
c 2ok
d 3ok
e 4ok
dtype: object
s2 = s.apply(np.mean)
print('-' * 20, '\n', s2, sep='')
s3 = np.mean(s)
print('-' * 20, '\n', s3, sep='')
--------------------
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
--------------------
2.0
参考文档:
[1] pandas中文网:https://www.pypandas.cn/docs/
评论