预训练卷积超越预训练Transformer?
机器学习算法工程师
共 4239字,需浏览 9分钟
·
2021-06-24 11:24
点蓝色字关注“机器学习算法工程师”
设为星标,干货直达!
来源|知乎 作者|DengBoCong
链接|https://zhuanlan.zhihu.com/p/380195756
We implement a Seq2Seq (Sutskever et al., 2014) architecture similar to (Wu et al., 2019). The key difference when compared with Transformer architectures is that we replace the multi-headed selfattention with convolutional blocks. Instead of query-key-value transforms, we use gated linear unit projections following (Wu et al., 2019).
机器学习算法工程师
一个用心的公众号
评论