【机器学习】平衡你的数据集的几项技巧
机器学习初学者
共 3368字,需浏览 7分钟
·
2021-05-10 20:47
作者 | Praveen Thenraj
作者 | Praveen Thenraj
编译 | VK
来源 | Towards Data Science
import pandas as pd
train=pd.read_csv('/Desktop/Files/train_data.csv')
print(train['Top-up Month'].value_counts())
No Top-up Service 106677
> 48 Months 8366
36-48 Months 3656
24-30 Months 3492
30-36 Months 3062
18-24 Months 2368
12-18 Months 1034
Name: Top-up Month, dtype: int64
方法
def chunks(df,folds):
df_no_topup=df.loc[df['Top-up Month']==0]
df_topup=df.loc[df['Top-up Month']==1]
recs_no_topup=int(df.loc[df['Top-up Month']==0].shape[0]/folds)
start_no_topup=0
stop_no_topup=recs_no_topup
list_df=[]
for fold in range(0,folds):
fold_n=df_no_topup.iloc[start_no_topup:stop_no_topup,:]
start_no_topup=stop_no_topup
stop_no_topup=start_no_topup+recs_no_topup
df=pd.concat([fold_n,df_topup],axis=0)
list_df.append(df)
return list_df
Major class initially - 106677
Fold size(n) - 5
Major class data(k) per fold=106677/5 - 21335
Minor class(all minor classes combined) - 21978
Total data per fold(major+minor) - 43313
list_data=chunks(df_train_main,5)
list_data_shape=[df.shape for df in list_data]
print(list_data_shape)
[(43313, 6), (43313, 6), (43313, 6), (43313, 6), (43313, 6)]
添加到每个折中的大多数类数据的分布。因为我们只是将整个大类数据划分为折(n),折的分布与原有分布不同。 不过,在所有需要解决的问题中,类(12-18个月)和主要类(“No Top-up Service”)之间仍然存在不平衡。但对于二类,这种方法效果更好。
往期精彩回顾
本站qq群851320808,加入微信群请扫码:
评论