100,000 元总奖金丨“未来杯 AI 挑战赛” baseline正式发布
算法能否判断一篇冷门角落里的论文,比一篇在自媒体上刷屏的论文更适合你读?
与这个问题息息相关,今年的“未来杯 AI 挑战赛” 赛题设置为预测一篇论文对哪些学者具有吸引力。日前,这一由智谱・AI 与 AI TIME 联合承办的比赛已正式启动(比赛页面可以访问“阅读原文”)。
本次比赛仍在进行中,并总奖金池 100000 元,使用 AMiner提供的数据集,任务、数据集、解题思路详解见“数据实战派”文章《10万元奖金:2021未来杯人工智能科技探索大赛赛题详解》。在此基础之上,本文是针对该竞赛的baseline介绍。
baseline地址:
https://fs.163.com/fs/display/?file=uC5xvCqblbWSGR0PpITsZx8lDrLyAiNgbAEpnVKGwqRlugVbRQA-vZO5H4uvsTi23KynHa_UhDxa_aE1_KSbGQ
一、数据处理
1. train/valid/test数据集划分:
将给定的train data按照train/valid/test划分,进行offline的training和test。
划分的原则:我们构建的是paper-expert pairs,希望借助这样的样本对来分别学习paper的embedding和expert的embedding,最后根据竞赛的要求,进行top-K的召回,为每一篇paper召回500个相关experts。因此,划分原则是对每一篇paper,与该paper交互的所有的experts,随机选择一个作为test,再选择一个作为valid,其余的experts作为train。这里,当该paper交互的experts少于3人时,我们将该paper和其交互的experts全部作为train data。
2. 数据预处理
如上述数据集的划分所言,我们的训练样本是paper-expert pairs。现在,我们需要对每对paper-expert进行预处理,下面分别进行预处理介绍:
Paper:在给定的数据集中,每一篇paper有多种属性,包括id、title、abstract、keywords等等(中文和英文版本),这里,我们选择title、abstract、keywords(中文版)这三个属性来作为paper的feature进行数据预处理。对这些feature的处理需要用nlp领域的知识。这里我们选择的是oag-bert来将paper的这些属性作为feature输入。值得注意的是输入到oag-bert的一段sequence,而这里的keywords是list,我们需要将其处理成seq,具体的做法是将keywords逐次相连成一个seq,每个keyword之间用空格隔开。title和abstract不用多做处理。这里我们给出了我们处理paperdata的代码,仅供参考。
def get_papers(self, data_dict): paper_infos = {} for iid in data_dict:
paper_id, title, abstract, keywords, year = iid["id"]( iid["title"], iid["abstract"], iid[“keywords"], iid["year"
# process keywords str.keywords = "" if keywords !=
for word in keywords:
if word == keywords[0]: str_keywords = word else:
str_keywords = str_keywords + 1 ' + word
# check data — remove unexist title and abstract data if title == "" and abstract =="" and str.keywords ==
print("unexisting title and abstract and keywords paper data....") else:
infos = {"title": title, "abstract": abstract, "keywords": str_keywords, "year": year} if paper_id in paper_infos:
printC'repeat paper id ")
else:
paper_infos[paper_id] = infos return paper_infos
Expert:在给定的数据集中,每个expert有多种属性,包括id、interests、tags、pub_info等等。这里我们选择interests作为每个expert的研究兴趣,若是interests为空,选择tags为研究兴趣,或者tags也为空,则interests为空。此外,我们还选择了pub_info作为feature,pub_info中包含了该expert发表的论文,这里我们同样选择title、abstract及keywords作为feature。expert数据的预处理代码:
def get_experts(self, data.dict): expert.infos = {} for expert in data.dict:
expert.id = expert['id'] pub.info = expert["pub_info“] if expert.getC'interests", None) != None: interests = expert[“interests" ] elif expert.getC'tags", None) != None: tags = expertt "tags"] interests = tags else:
interests = []
# process interests str_interests = '"' len_interests = len(interests) if len_interests != 0:
for interest in interests:
if interest == interests[9]:
str.interests = interest['t' ] else:
str.interests = str.interests + ' ' + interest!'t' ]
# process pub_info pub_infos - {}
for iid in pub_info: pid = iidC'id"]
if iid.get('title', None) 1= None: title = iid['title']
else:
title = '"'
if iid.getf'abstract", None) != None: abstract = iid["abstract"] else:
abstract = ""
if iid.get("keywords". None) 1= None: keywords = iid[ "keywords'1] else:
keywords = []
infos = {"title": title, “abstract": abstract, "keywords": keywords} if pid in pub_infos:
pass else:
if title == "" and abstract == "" and keywords == []: print C'pub.info ", iid)
pass
else:
pub_infos[pidJ = infos
U check expert data remove data unexist interests and pub.info if str_interests == "" and pub_infos == {}: print("unexisting_id", expert_id)
else:
if expert.id in expert.infos:
pass
else:
info = {“interests": str.interests, “pub.info": pub.infos} expert_infos[expert_id] = info
return expert_infos
二、模型选择
这里,我们选择oag-bert作为我们学习embedding的模型。
oag-bert以每篇paper的title、keywords、abstract作为输入,来得到每一篇paper的embedding表示。
这里,我们需要先得到每篇paper的title、keywords、abstract等的token表示。具体的我们在batch数据的得到中详细讲述。
三、batch数据
这里,我们需要得到batch数据,进行oag-bert的finetune训练。值得注意的是,给定的数据集中只包含有paper-expert正的交互项,没有负样本,这里我们对每一个正paper-expert选择negs_num个负样本,具体的操作如下:将该paper没有交互过的所有experts作为负样本的候选集合,然后随机采样negs_num个。这样我们可以得到batch的数据,包括正、负样本。代码如下,注意,这里得到的训练样本集合的是[一篇paper, 一个正expert,negs_num个负experts]:
def generate_batch_data(self, paper.infos, experts.infos, batch, neg_num): batch_infos_anchor = [] batch_infos_pos = [] batch_infos_neg = []
# generate anchor, pos, neg for pid, eid in batch:
p.infos = paper.infostpid] e.infos = experts_infos[eid]
# build anchor, pos, neg
# anchor
anchor.infos = <} if pid in anchor_infos:
print("repeat pid ")
else:
anchor_infos[pid] = p_infos batch_infos_anchor.append7anchor.infos)
# pos
pos.infos = {> if pid in pos.infos:
print("repeat pid in pos.infos...") else:
pos.infos[pid] = e.infos batch.infos.pos.append(pos.infos)
# negs
# random.sample K negs
negs.id * random.sampledist(set(experts.infos.keys()) - set(self.train_data_dict[pid])), neg.num) negs.infos = {} for neg in negs.id:
if neg in negs.infos:
print("repeat negs in negs.info ")
else:
negs_infos[neg] = experts.infostneg]
if experts.infostneg]["interests"] == "" and experts.infostneg]["pub.info"] == {>: printC'experts.empty", experts.infostneg])
while True:
if experts.infostneg]["interests"] != "": negs.infostneg] = experts.infostneg] break
# check pub.info when interests == "" elif experts_infos[neg]t"pub_info"] != {>:
negs.infostneg] = experts.infostneg]
| break
# unexisting interests and pub.info (the information for this expert is useless) else:
# sample another neg
print("experts.infostneg]", experts.infostneg]) printC'expert.id", neg)
printC'candidate.negs", len(list(set(experts_infos.keys()) - set(self.train_data_dict[pid]) - set(negs.id)))) neg = random.sample(list(set(experts_infos.keys()) - set(self.train_data_dict[pid]) - set(negs.id)), 1)[0] printC'neg.another", neg)
batch.infos.neg.append(negs.infos)
return batch.infos.anchor, batch.infos.pos, batch.infos.neg
上述得到的batch数据并不是可以直接输入到oag-bert中的数据,我们需要将这些seq数据词化,也就是得到这些seq的tokens。下面给出得到token的代码:
def get_batch_tokens(self, infos, flag):
batch_tokens = [] for info in infos: tokens.dict = {>
for p_id, p_info in info.items():
# get tokens
tokens = self.build_bert_inputs(p_info, flag) #print("tokens...", tokens) if p_id in tokens_dict:
printC'repeat p_id '')
else:
tokens_dict[p_id] = tokens batch_tokens.append(tokens_dict) return batch.tokens
def build_bert_inputs(self, p_info, flag): if flag == “anchor":
# title & abstract & keywords
if p_info.get("abstract". None) != None: abstract = p_info[”abstract"] else:
abstract =
if p_info.get(“title", None) != None: title = p.infoC'title"]
else:
title = " "
if p_info.get("keywords", None) != None: keywords = p_info[“keywords"]
else:
keywords = ""
return self.oagbert.build_inputs(title=title, abstract=abstract, concepts=keywords) elif flag == "pos" or flag == "neg":
# experts
if p_info.get("interests", None) != None: interests = p_info["interests"] else:
interests = ""
e_tokens = self.oagbert.build_inputs(title=interests, abstracts''")
# prcoess experts'pub infos expert_pub_infos = p_info["pub_info“] pub.tokens = {>
for pid, expert_pub_info in expert_pub_infos.items(): title = expert_pub_info['title'] abstract = expert_pub_info['abstract'] keywords.list = expert_pub_info['keywords'] keywords = '"'
if len(keywords_list) != 0: for word in keywords.list:
if word == keywords_list[9]: keywords = word
else:
keywords = keywords + ' ' + word
p_tokens = self.oagbert.build_inputs(title=title, abstract=abstract, concepts=keywords) if pid in pub_tokens:
printC'repeat pid ")
else:
pub_tokens[pid] = p_tokens
tokens = {"interests": e.tokens, "pub.info": pub.tokens} return tokens else:
raise Exception(“undefine flag")
return
四、得到paper embedding
当得到paper的tokens后,我们可以将其输入到oag-bert,得到paper的embedding。这里,我们简单讲解一下:
当我们调用get_batch_tokens()函数,会得到paper的tokens,这里的tokens包括input_ids、input_masks、token_type_ids、masked_lm_labels、position_ids、position_ids_second、masked_positions、num_spans。我们的oag-bert的输入需要input_ids、input_masks、token_type_ids、position_ids、position_ids_second作为输入,输出paper的embedding表示pooled_output:
input_ids, input.masks, token_type_ids, masked_lm_labels, position_idsf position_ids_second, maksed_positions, num_spans = token pooled_output = model.bert.forwardC
input_ids=torch. Long Tensor (input_ids) .unsqueezed(0) , cuda (),
token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0).cuda(),
attention_mask=torch.LongTensor(input_masks).unsqueeze(0).cuda(),
output_all_encoded_layers=False,
checkpoint_activations=False,
position_ids=torch.LongTensor(position_ids).unsqueeze(0).cuda(),
position_ids_second=torch,LongTensorf position_ids_second).unsqueeze(0).cuda())
五、得到expert embedding
当得到expert的tokens后,我们可以将其输入到oag-bert,得到expert的embedding。值得注意的是,这里expert的embedding由两部分组成:
1. expert的interests的embedding;
2. expert的pub_info的embedding。
首先,我们先来看interests的embedding。
当我们调用get_batch_tokens()函数,会得到expert的interests的tokens,这里的tokens同样包括input_ids、input_masks、token_type_ids、masked_lm_labels、position_ids、position_ids_second、masked_positions、num_spans。我们的oag-bert的输入需要input_ids、input_masks、token_type_ids、position_ids、position_ids_second作为输入,输出expert的interests的embedding表示pooled_output_expert_interests,注意若是该expert的interests为空,这时interests的embedding为None:
# interests process interests = token[“interests“]
#print(“interests", interests)
input.ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, maksed_positions, num_spans = interests
if input_ids != []:
_,pooled_output_expert_interests = model.bert.forwardt
input_ids=torch.LongTensor(input_ids).unsqueeze(0).cuda(),
token_type_ids=torch.LongTensor(token_type_ids).unsqueeze*0) .cudaO,
attention_mask=torch.LongTensor(input_masks).unsqueeze(0).cudat),
output_all_encoded_layers=False,
checkpoint_activations=False,
position_ids=torch. LongTensor(position_ids) .unsqueezed(0).cuda(),
position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0).cuda())
else:
pooled_output_expert_interests = None
下面,我们接着看pub_info的embedding,本质上而言,pub_info就是对experts发表的每一篇paper做embedding表示,和paper的embedding得到的方式相同:
# pub_info of experts process
pub.info = token["pub_info"]
pub_info_embed = []
for pid, p_token in pub_info.items():
input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second,
if input_ids != []:
_,pooled_output_expert_pub = model.bert.forward(
input_ids=torch.LongTensor(input_ids).unsqueezed) .cuda(),
token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0).cuda(),
attention_mask=torch.LongTensor(input_masks).unsqueeze(0),cuda(),
output_all_encoded_layers=False,
checkpoint_activations=False,
position_ids=torch.LongTensor(position_ids).unsqueeze(0).cuda(),
position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0).cuda())
pub.info.embed.append(pooled_output_expert_pub)
值得注意的是,每个expert发表的paper的数量不同,我们需要利用该expert发表的所有paper。
最后,我们结合interests embedding和pub_info embedding。这里,我们简单的使用了torch.mean()来进行merge这两种embedding来得到最终的expert embedding。
# merge interests_embed and pub_info_embed
# here, we use torch.meanO for merge
# check
if pooled_output_expert_interests == None:
if len(pub_info_embed) != 0:
pooled_output_expert_cat = torch.cat(pub_info_embed)
else:
pooled_output_expert_cat = None
else:
if len(pub_info_embed) != 0:
pooled_output_expert_cat = torch.cat((pooled_output_expert_interests, torch.cat(pub_info_embed)), 0)
else:
pooled_output_expert_cat = pooled_output_expert_interests
pooled_output_expert_final = torch.mean(pooled_output_expert_cat, 0).view(l, configC1output_dim'])
各位选手可以对这步merge过程进行改进,以获得更好的expert的embedding表示。
六、loss及训练(finetune)
当得到embedding后,我们需要定义我们训练所需的loss。这里我们采用的是infonce loss。具体的loss形式如下:
关于infonce loss的详细理解,各位选手可以自行学习,这里不再赘述。
# infoNCE loss
class infoNCE(nn.Module):
def init (self):
super(infoNCE, self).__init__()
self.T = 0.07
self.cross_entropy = nn.CrossEntropyLoss().cuda()
def forward(self, query, pos, neg):
query = F.normalize(query.view(batch_size, 1, dim) p=2,dim=2 )
pos = F.normalize(pos.view(batch_size, 1, dim), p=2,dim=2 )
neg = F.normalize(neg.view(batch_size, K, dim), p=2,dim=2 )
pos_score = torch.bmm(query, pos.transposed(1, 2)) #B*1*1
neg_score = torch.bmm(query, neg.transposed(1, 2)) #B*1*K
# logits:B*(K+l)
logits = torch.cat([pos_score, neg_score], dim=2). squeeze()
logits /= self.T
labels = torch.zeros(logits.shape[0], dtype=torch.long).cuda()
info_loss = self.cross_entropy(logits, labels)
return info_loss
注意:我们加了MLP去非线性变换了oag-bert得到的embedding,以便更好地finetune。
class MLP(nn.Module):
def init (self, in_dim):
super(MLP, self). init ()
self.projection = nn.Sequential( nn.Linear(in_dim, in_dim), nn.ReLU(),
nn.Linear(in_dim, in_dim)
def forward(self, x):
x = self.projection(x)
return x
整个的training过程的代码如下:
# infoNCE loss
criterion = infoNCE)).cuda))
optimizer = torch.optim.Adam([)'params':model.parameters))>,{’params': projection.parameters))}], lr=config["learning_rate“])
model.train)) projection.train)) best.mrr = -1 patience = 0
# finetuning
for epoch in range(epochs):
random.shuffle(train_batches)
batch_loss = [] batch_num = 0
for batch in train.batches: batch.num += 1
# get anchor pos neg
anchor, pos, neg = data_loader.generate_batch_data(paper_infos, experts.infos, batch, config[“neg_num"])
anchor.tokens = data_loader.get_batch_tokens(anchor, "anchor") pos.tokens = data_loader.get_batch_tokens(pos, "pos") neg.tokens = data_loader.get_batch_tokens(neg, "neg")
anchor_emb = get_batch_embed(anchor_tokens, "anchor") pos_emb = get_batch_embed(pos_tokens, “pos") neg_emb = get_batch_embed(neg_tokens, “neg")
# add MLP
# infoNCE loss
loss = criterion)projection(anchor_emb), projection(pos_emb), projection(neg_emb)) print)"loss...", loss.item)))
# compute gradient and do Adam step optimizer.zero.grad))
loss.backward)) optimizer.step))
if batch.num > 1 and batch_num % valid.step == 0:
print) "evalute ")
t.valid = time))
mrr = evaluatefmodel, valid.batches, data.loader, paper_infos, experts_infos) print)"time for valid ", time)) - t_valid)
print)"Epoch:{} batch:{} loss:{} mrr:{}".format(epoch, batch_num, loss.item)), mrr)) if mrr > best_mrr: best_mrr = mrr tt save model
torch.save(model.state_dict(), output.dir + "oagbert")
print)“Best Epoch:{} batch:{} loss:{} mrr:)}".format(epoch, batch.num, loss.item)), mrr))
else:
patience += 1
if patience > config[“patience"]:
printC("Best Epoch:{} batch:)} loss:)} mrr:)}".format(epoch, batch.num, loss.item)), mrr))
model.train()
projection.train()
七、Test/valid
最后,我们在valid/test上进行验证/测试,valid可以用来调整我们模型的超参,test来测试我们模型的泛化能力。这些都是offline的测试。最终,保存效果最好的模型参数,再进行online的valid和test。
这里,我们用mrr来作为评价指标,当然,各位选手可以选择给定的评价指标。
注意:由于test时,为每篇paper做全局召回,耗时太大,我们采用的是随机sample 100个负样本,与test/valid中的正例进行rank。
为了进一步节省test的时间,我们采用了两种test方式:
1. 每个batch共享100个negs。
2. 整个test共享100个negs。
def evaluate(model, valid_batches, data_loader, paper_infos, experts_infos):
model.eval()
mrr =0.0
total_count = 0
with torch.no_grad():
"""
negs = data_loader.generate_negs_data(paper_infos, experts_infos, configt"Negs"])
neg_tokens = data_loader.get_batch_tokens(negs, "neg")
neg_emb_candidates = get_batch_embed(neg_tokens, "neg")
"""
share batch negs
"""
for batch in valid_batches:
anchor, pos, _ = data_loader.generate_batch_data_test(paper_infos, experts_infos, batch, configt"Negs"])
# use too much time
anchor_tokens = data_loader.get_batch_tokens(anchor, "anchor") pos_tokens = data_loader.get_batch_tokens(pos, "pos")
# use too much time
anchor_emb = get_batch_embed(anchor_tokens, "anchor") pos_emb = get_batch_embed(pos_tokens, "pos") neg_emb = neg_emb_candidates.repeat(len(batch), 1)
# batch share negs
#for batch in valid_batches:
# anchor, pos, negs = data_loader.generate_batch_data_test(paper_infos, experts_infos, batch, configt"Negs"])
# # use too much time
# tt = timeO
# anchor_tokens = data_loader.get_batch_tokens(anchor, "anchor")
# print ("time...", timeO - tt)
# tt = timeO
# pos_tokens = data_loader.get_batch_tokens(pos, "pos")
# print ("time...", timeO - tt)
# tt = timeO
# neg_tokens = data_loader.get_batch_tokens(negs, "neg")
# print ("time...", timeO - tt)
# # use too much time
# tt = timeO
# anchor_emb = get_batch_embed(anchor_tokens, "anchor")
# print ("time...", timeO - tt)
# tt = timeO
# pos_emb = get_batch_embed(pos_tokens, "pos")
# print ("time...", timeO - tt)
# tt = timeO
# neg_emb_candidates = get_batch_embed(neg_tokens, "neg")
# print ("time...", timeO - tt)
# neg_emb = neg_emb_candidates.repeat(len(batch), 1)
# anchor & pos_embed
anchor_emb = F.normalize(anchor_emb.view(-l, 1, dim), p=2, dim=2)
pos_emb = F.normalize(pos_emb.view(-l, 1, dim), p=2, dim=2)
neg_emb = F.normalize(neg_emb.view(-l, configt"Negs"], dim), p=2, dim=2)
pos_score = torch.bmm(anchor_emb, pos_emb.transpose(1, 2)) # B*l*l neg_score = torch.bmm(anchor_emb, neg_emb.transpose(1, 2)) # B*l*Negs
# logits:B*(l+Negs)
logits = torch.cat([pos_score, neg_score], dim=2).squeeze()
logits = logits.cpuO .numpyO
for i in range(batch_size):
total_count += 1
logits_single = logits[i]
rank = np.argsort(-logits_single)
true_index = np.where(rank==0)[0][0]
mrr += np.divide(1.0, true_index+l)
mrr /= total_count
return mrr
建议加入【2021未来杯人工智能赛交流群】,群内可以进行参赛问题解答、组队邀约、选手交流等。(若群满200人可以添加小助手,微信号:hhming98,由他拉入交流群 )