手把手教你如何做建模竞赛（baseline代码讲解）-技术圈

点击上方“视学算法”，选择加"星标"或“置顶”

重磅干货，第一时间送达

1.大赛背景

随着科技发展，银行陆续打造了线上线下、丰富多样的客户触点，来满足客户日常业务办理、渠道交易等需求。面对着大量的客户，银行需要更全面、准确地洞察客户理财需求。在实际理财产品业务开展过程中，需要挖掘不同理财产品对客群的吸引力，从而找到目标客群，进行针对性营销。

本次竞赛提供实际业务场景中的客户行为、资产信息、产品交易信息等为建模对象，一方面希望能借此展现各参赛选手的数据挖掘实战能力，另一方面需要选手在复赛中结合建模的结果提出相应的营销解决方案，充分体现数据分析的价值。

2.赛题描述

(1) 赛题任务

此次竞赛题目主要是针对客户购买各类理财产品存单概率进行预测，并将预测结果作为营销方案的依据。

(2) 数据使用规则

本赛题不能使用任何外部数据。本次提供的数据经过脱敏，部分连续型数据（如利率、价格、金融等）经过一定的线性变换，但不影响建模使用和模型预测结果。

(3) A/B榜规则

本次初赛采用AB榜形式。初赛时间总共一个半月，前一个月排行榜显示A榜成绩（有公私榜，公私榜比例是6:4）后半个月切换成B榜单（有公私榜），排行榜显示B榜成绩，以参赛者提交的最高分为准，最后初赛成绩=A榜成绩0.3+B榜成绩0.7。

3.评估指标

1、初赛采用A/B榜赛制，最终初赛成绩=0.3A榜测试集F2值+0.7B榜测试集F2值，其中：

recall = TP/(TP+FN)，召回率

precision = TP/(TP+FP)，精准率

F2 = 5recallprecision/(4*precision+recall)，F2值

TP是真样例，FP是假阳例，FN是假阴例，通过以上公式得到该类F2值。

4.数据描述

本次比赛的任务核心是通过用户7，8，9月的历史消费记录来预测其在10月是否会有购买行为，赛题给的数据表非常之多，这里不详细进行展开，具体可以查看赛题主办方所给的数据描述。

5.Baseline思路

本赛题是一个非常典型的结构化数据的比赛，这里我们依然采取传统的特征工程+lgb(也可以是xgb，cat)的方案。由于本赛题有较强的时序性，所以这里我们在线下进行验证的时候使用7，8月的数据进行训练在第9月的数据上进行验证，而在线上进行提交的时候，则使用7，8，9月三个月的数据进行训练.这里需要注意的是，这次赛题的线上评分标准为F2的得分，在进行F2得分计算的时候需要我们提交的预测结果是0和1的整数值，但是我们的模型预测的结果是一个概率值，这里就涉及到一个阈值的选取，这里建议大家在线下验证的时候手动搜索一下这个阈值，其代码如下所示：

def search_best_thre(y_true,y_pre):    best_f2 = 0    best_th = 0    for i in range(100):        th = 0.03+i/1000        y_pre_copy = y_pre.copy()        y_pre_copy[y_pre_copy >= th] = 1        y_pre_copy[y_pre_copy < th] = 0                temp_f2 = f2_score(y_true,y_pre_copy)                if temp_f2>best_f2:            best_f2 = temp_f2            best_th = th                        print(f'thre: {best_th} f2 score: {best_f2}')    return best_th

其中，这里的f2_score就是我们自己手动实现的计算F2分数的评估函数

6.特征工程

6.1 用户信息表

这里对用户信息表进行简单的与主表进行合并，以此来获取用户信息，其代码如下：

# 客户信息表d = pd.read_csv(data_root2 + 'd.csv')df = df.merge(d, on='core_cust_id', how='left')del dgc.collect()

6.2 用户风险表

这里的用户风险由于是每个一段实际就会对用户的风险进行一次评级，所以这里每一个用户都有可能会对应多个风险等级，这里我们只选取最近的对用户风险进行评级的得分，具体做法是我们以user和评估日期为主键进行排序，然后按user为主键进行去重，并且在去重的时候仅保留最后一条数据，这样我们便保留了用户最近一次风险评估的记录，其代码如下：

# 客户风险表e = pd.read_csv(data_root2+'e.csv')e = e.sort_values(['core_cust_id','e2'])e = e.drop_duplicates(subset=['core_cust_id'],keep='last')df = df.merge(e[['core_cust_id','e1']],on='core_cust_id',how='left')del egc.collect()

6.3 用户资产信息表

用户的资产信息也会在不同时间点有不同的变化，这里我们统计计算每个月用户的各种资产的平均值作为用户的特征，其代码实现如下：

# 资产信息表f = pd.read_csv(data_root2 + 'f.csv')f.fillna(0, inplace=True)map_dict = dict(zip(sorted(f['f22'].unique()), sorted(df['a3'].unique())))f['f22'] = f['f22'].map(map_dict)for c in ['f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9',          'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16',          'f17', 'f18', 'f19','f20', 'f21']:    f[c] = f[c].apply(lambda x: str(x).replace(',','')).astype('float')      f_stat = f.groupby(['core_cust_id','f22']).agg(                                             f2_mean=('f2', 'mean'),                                              f3_mean=('f3', 'mean'),                                              f4_mean=('f4', 'mean'),                                              f5_mean=('f5', 'mean'),                                             f6_mean=('f6', 'mean'),                                             f7_mean=('f7', 'mean'),                                             f8_mean=('f8', 'mean'),                                             f9_mean=('f9', 'mean'),                                             f10_mean=('f10', 'mean'),                                             f11_mean=('f11', 'mean'),                                             f12_mean=('f12', 'mean'),                                             f13_mean=('f13', 'mean'),                                             f14_mean=('f14', 'mean'),                                             f15_mean=('f15', 'mean'),                                             f16_mean=('f16', 'mean'),                                             f17_mean=('f17', 'mean'),                                             f18_mean=('f18', 'mean'),                                             f19_mean=('f19', 'mean'),                                             f20_mean=('f20', 'mean'),                                             f21_mean=('f21', 'mean'),                                             ).reset_index()df = df.merge(f_stat, left_on=['core_cust_id','a3'], right_on=['core_cust_id','f22'], how='left')del f, f_statgc.collect()

6.4 账户交易流水表

用户的交易流水同样也十分重要，这里的账户交易流水信息包含了借方的信息和贷方的信息，我们对这两部分分别统计，风别统计借方与贷放每个月涉及金额的统计值，并且将其作为历史信息拼接到原有的数据表中，他的意思就是比如我预测10月份改用户会不会购买东西，那么我就将其9月份的账户交易信息作为其历史特征，对于预测9月份用户会不会购买东西，就将其8月的账户交易信息作为历史特征，其代码实现如下：

# S 表：账户交易流水表，其中 s3 和 s6 是客户编号，可以和其他表中的 core_cust_id进行关联。s = pd.read_csv(data_root2+'s.csv')s['month'] = s['s7'].apply(lambda x: x.split('-')[1]).astype('int32')s['s4'] = s['s4'].apply(lambda x: str(x).replace(',','')).astype('float')tmp_s3 = s.groupby(['s3','month']).agg(                             s3_s4_sum=('s4', 'sum'),                             s3_s4_mean=('s4', 'mean'),                             s3_s4_count=('s4','count')                             ).reset_index()tmp_s6 = s.groupby(['s6','month']).agg(                             s6_s4_sum=('s4', 'sum'),                              s6_s4_mean=('s4', 'mean'),                             s6_s4_count=('s4','count')                             ).reset_index()tmp_s3['month']  = tmp_s3['month']+1tmp_s6['month']  = tmp_s6['month']+1tmp_s3 = tmp_s3.rename(columns={'s3':'core_cust_id'})tmp_s6 = tmp_s6.rename(columns={'s6':'core_cust_id'})df = df.merge(tmp_s3,on=['core_cust_id','month'],how='left')df = df.merge(tmp_s6,on=['core_cust_id','month'],how='left')

6.5 app 点击行为表

这张表记录了user和产品之间的交互记录，这里简单的对user和产品交互情况进行统计，其代码如下：

#app 点击行为表r = pd.read_csv(data_root2+'r.csv')r = cross_enc(r,'core_cust_id','prod_code')r = r.sort_values(['core_cust_id','r5']).reset_index(drop=True)r = r.drop_duplicates(subset=['core_cust_id'],keep='last')df = df.merge(r[['core_cust_id','prod_code','core_cust_id_count','prod_code_count','core_cust_id_prod_code_count','cross_core_cust_id_prod_code_count_div_core_cust_id_count','cross_core_cust_id_prod_code_count_div_prod_code_count']],on=['core_cust_id','prod_code'],how='left')

6.6 目标编码

这里对user和产品id进行了五折目标编码，其代码如下：

## 五折目标编码from sklearn.model_selection import StratifiedKFoldlabel = 'y'skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)enc_list = ['core_cust_id','prod_code']for f in tqdm(enc_list):    train_df[f + '_target_enc'] = 0    test_df[f + '_target_enc'] = 0    for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df[label])):        trn_x = train_df[[f, label]].iloc[trn_idx].reset_index(drop=True)        val_x = train_df[[f]].iloc[val_idx].reset_index(drop=True)        enc_df = trn_x.groupby(f, as_index=False)[label].agg({f + '_target_enc': 'mean'})        val_x = val_x.merge(enc_df, on=f, how='left')        test_x = test_df[[f]].merge(enc_df, on=f, how='left')        val_x[f + '_target_enc'] = val_x[f + '_target_enc'].fillna(train_df[label].mean())        test_x[f + '_target_enc'] = test_x[f + '_target_enc'].fillna(train_df[label].mean())        train_df.loc[val_idx, f + '_target_enc'] = val_x[f + '_target_enc'].values        test_df[f + '_target_enc'] += test_x[f + '_target_enc'].values / skf.n_splits

7.Baseline结果

在构造完特征之后，使用lgb对所得的数据进行训练，其线下的阈值搜索结果如下：

thre: 0.03 f2 score: 0.3942484810161383thre: 0.031 f2 score: 0.39657528516049867thre: 0.032 f2 score: 0.39895684622669514thre: 0.033 f2 score: 0.40054636211507927thre: 0.034 f2 score: 0.4026079162691184thre: 0.034999999999999996 f2 score: 0.4044912629302788thre: 0.036 f2 score: 0.40612958944151645thre: 0.037 f2 score: 0.40753496138593204thre: 0.038 f2 score: 0.40927755967690993thre: 0.039 f2 score: 0.410503800118327thre: 0.04 f2 score: 0.4126804328858547thre: 0.040999999999999995 f2 score: 0.41439493397173527thre: 0.041999999999999996 f2 score: 0.414933881594318thre: 0.043 f2 score: 0.4164246353831421thre: 0.044 f2 score: 0.417484643151162thre: 0.045 f2 score: 0.4190938253084154thre: 0.046 f2 score: 0.4201667149431947thre: 0.047 f2 score: 0.4213196324089531thre: 0.048 f2 score: 0.4225329108548656thre: 0.049 f2 score: 0.42326606470288314thre: 0.05 f2 score: 0.4239773348575973thre: 0.051000000000000004 f2 score: 0.4246038365304421thre: 0.052 f2 score: 0.42623418944115027thre: 0.054 f2 score: 0.4272477986135949thre: 0.055 f2 score: 0.4281746644036414thre: 0.057999999999999996 f2 score: 0.428551417415209thre: 0.059 f2 score: 0.4293245411428671thre: 0.06 f2 score: 0.4303395127494851thre: 0.061 f2 score: 0.43075245365321696thre: 0.062 f2 score: 0.43146751165763464thre: 0.063 f2 score: 0.43217145548751756thre: 0.064 f2 score: 0.4324398249452955thre: 0.065 f2 score: 0.433346787509632thre: 0.066 f2 score: 0.4336518242520382

在线上进行测评的得分为：0.29，可以看出本次赛题的线下和线上的差异有一点大

8.展望

本次baseline使用的表不是很多，还有很多与产品相关的表没有使用，可以加入对这些产品相关的表的使用尝试其他的boosting模型，例如xgb，cat等等可以尝试使用一些自动调参工具对lgb进行调参模型融合。

对本次比赛感兴趣的同学也可以参与深度之眼开设的baseline指导课程，旨在通过2次直播讲清楚此次大赛baseline的构建思路、代码以及优化。

1月13日直播：手把手教你baseline构建（代码讲解）
1月20日直播：baseline优化讲解

带着大家在实践中学习、在学习中实践，打出好成绩，学到知识。

2次直播、全程答疑。

原价198元，13日前限时0.1元！

👆长按二维码，立即参赛👆

第一次参赛没关系，群内300+小伙伴，我们帮你组队！

课程大纲

赛圈大佬指导

2次直播、全程答疑。

原价198元，13日前限时0.1元！

👆长按二维码，立即参赛👆

第一次参赛没关系，群内300+小伙伴，我们帮你组队！

—1月13日开班—

点个在看 paper不断！