清洗数据,高效率7步走!
共 10933字,需浏览 22分钟
·
2020-10-08 07:36
导读:本文使用泰坦尼克数据集,完整介绍数据清洗 7 步的具体操作过程。
数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:
Step1 : read csv
Step2 : preview data
Step3: check null value for every column
Step4: complete null value
Step5: feature engineering
Step 5.1: delete some features
Step 5.2: create new feature
Step6: encode for categories columns
Step 6.1: Sklearn LabelEncode
Step 6.2: Pandas get_dummies
Step 7: check for data cleaning
01 读入数据
这不废话吗,第一步就是读入数据。
data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw
结果:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
02 数据预览
data_raw.info()
data_raw.describe(include='all')
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex:891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
unique NaN NaN NaN 8912 NaN NaN NaN 681 NaN 1473
top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S
freq NaN NaN NaN 1577 NaN NaN NaN 7 NaN 4644
mean 446.0000000.3838382.308642 NaN NaN 29.6991180.5230080.381594 NaN 32.204208 NaN NaN
std 257.3538420.4865920.836071 NaN NaN 14.5264971.1027430.806057 NaN 49.693429 NaN NaN
min 1.0000000.0000001.000000 NaN NaN 0.4200000.0000000.000000 NaN 0.000000 NaN NaN
25% 223.5000000.0000002.000000 NaN NaN 20.1250000.0000000.000000 NaN 7.910400 NaN NaN
50% 446.0000000.0000003.000000 NaN NaN 28.0000000.0000000.000000 NaN 14.454200 NaN NaN
75% 668.5000001.0000003.000000 NaN NaN 38.0000001.0000000.000000 NaN 31.000000 NaN NaN
max 891.0000001.0000003.000000 NaN NaN 80.0000008.0000006.000000 NaN 512.329200 NaN N
03 检查null值
data1 = data_raw.copy(deep=True)
data1.isnull().sum()
结果:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。
04 补全空值
data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)
data1.isnull().sum()
补全操作check:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
05 特征工程
1. 干掉 3 列
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
2. 增加 3 列
增加一列 FamilySize
data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1
打印结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize
003 Braund, Mr. Owen Harris male 22.0107.2500 S 2
111 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.01071.2833 C 2
213 Heikkinen, Miss. Laina female 26.0007.9250 S 1
311 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.01053.1000 S 2
403 Allen, Mr. William Henry male 35.0008.0500 S 1
... ... ... ... ... ... ... ... ... ... ...
88602 Montvila, Rev. Juozas male 27.00013.0000 S 1
88711 Graham, Miss. Margaret Edith female 19.00030.0000 S 1
88803 Johnston, Miss. Catherine Helen "Carrie" female 28.01223.4500 S 4
88911 Behr, Mr. Karl Howell male 26.00030.0000 C 1
89003 Dooley, Mr. Patrick male 32.0007.7500 Q 1
891 rows × 10 columns
再创建一列:
data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)
再创建一列:
data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1
结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title
003 Braund, Mr. Owen Harris male 22.0107.2500 S 20 Mr
111 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.01071.2833 C 20 Mrs
213 Heikkinen, Miss. Laina female 26.0007.9250 S 11 Miss
311 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.01053.1000 S 20 Mrs
403 Allen, Mr. William Henry male 35.0008.0500 S 11 Mr
... ... ... ... ... ... ... ... ... ... ... ... ...
88602 Montvila, Rev. Juozas male 27.00013.0000 S 11 Rev
88711 Graham, Miss. Margaret Edith female 19.00030.0000 S 11 Miss
88803 Johnston, Miss. Catherine Helen "Carrie" female 28.01223.4500 S 40 Miss
88911 Behr, Mr. Karl Howell male 26.00030.0000 C 11 Mr
89003 Dooley, Mr. Patrick male 32.0007.7500 Q 11 Mr
891 rows × 12 columns
3. 分箱走起
data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1
结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667]
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667]
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0]
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667]
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0]
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667]
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0]
891 rows × 14 columns
06 编码
1. LabelEncoder 方法
使用 Sklearn 的 LabelEncoder
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1
结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!
2. get_dummies 方法
get_dummies 将长 DataFrame 变为宽 DataFrame:
pd.get_dummies(data1['Sex'])
结果:
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
... ... ...
886 0 1
887 1 0
888 1 0
889 0 1
890 0 1
891 rows × 2 columns
而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.
label.fit_transform(data1['Sex'])
0 1
1 0
2 0
3 0
4 1
..
886 1
887 0
888 0
889 1
890 1
Name: Sex_Code, Length: 891, dtype: int64
07 再 check
# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex:891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code 891 non-null int64
Pclass 891 non-null int64
Embarked_Code 891 non-null int64
Title_Code 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex:891 entries, 0 to 890
Data columns (total 29 columns):
Pclass 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
Embarked_C 891 non-null uint8
Embarked_Q 891 non-null uint8
Embarked_S 891 non-null uint8
Title_Capt 891 non-null uint8
Title_Col 891 non-null uint8
Title_Don 891 non-null uint8
Title_Dr 891 non-null uint8
Title_Jonkheer 891 non-null uint8
Title_Lady 891 non-null uint8
Title_Major 891 non-null uint8
Title_Master 891 non-null uint8
Title_Miss 891 non-null uint8
Title_Mlle 891 non-null uint8
Title_Mme 891 non-null uint8
Title_Mr 891 non-null uint8
Title_Mrs 891 non-null uint8
Title_Ms 891 non-null uint8
Title_Rev 891 non-null uint8
Title_Sir 891 non-null uint8
Title_the Countess 891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB
Great !
Done~