魔改Attention大集合

极市导读
如何对attention进行高效改进?本文盘点了相关论文,并梳理出它们的引用量、代码实现、算法复杂度和关键点,方便对比使用。
Efficient Attention
| Paper (引用量) | 源码实现 | 复杂度 | AutoRegressive | Main Idea | 
|---|---|---|---|---|
| Generating Wikipedia by Summarizing Long Sequences[1] (208) | memory-compressed-attention[2]![]()  | ![]()  | ![]()  | |
| CBAM: Convolutional Block Attention Module[3] (677) | attention-module[4] ![]()  | ![]()  | ![]()  | |
| CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149) | CCNet[6]![]()  | ![]()  | ![]()  | |
| Efficient Attention: Attention with Linear Complexities[7] (2) | efficient-attention[8]![]()  | ![]()  | ![]()  | |
| Star-Transformer[9] (24) | fastNLP[10]![]()  | ![]()  | ![]()  | |
| Generating Long Sequences with Sparse Transformers[11] (139) | torch-blocksparse[12]![]()  | ![]()  | ![]()  | |
| GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96) | GCNet[14] ![]()  | ![]()  | ![]()  | |
| SCRAM: Spatially Coherent Randomized Attention Maps[15] (1) | - | ![]()  | ![]()  | |
| Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13) | IN_PAPER | ![]()  | ![]()  | |
| Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2) | Permutohedral_attention_module[18]![]()  | ![]()  | ![]()  | |
| Large Memory Layers with Product Keys[19] (28) | XLM[20]![]()  | ![]()  | ![]()  | |
| Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38) | EMANet[22]![]()  | ![]()  | ![]()  | |
| Compressive Transformers for Long-Range Sequence Modelling[23] (20) | compressive-transformer-pytorch[24]![]()  | ![]()  | ![]()  | |
| BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8) | BPT[26]![]()  | ![]()  | ![]()  | |
| Axial Attention in Multidimensional Transformers[27] (5) | axial-attention[28]![]()  | ![]()  | ![]()  | |
| Reformer: The Efficient Transformer[29] (69) | trax[30]![]()  | ![]()  | ![]()  | |
| Transformer on a Diet[31] (2) | transformer-on-diet[32]![]()  | ![]()  | ||
| Sparse Sinkhorn Attention[33] (4) | sinkhorn-transformer[34]![]()  | ![]()  | ||
| SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1) | - | ![]()  | ||
| Efficient Content-Based Sparse Attention with Routing Transformers[36] (11) | routing-transformer[37]![]()  | ![]()  | ||
| Longformer: The Long-Document Transformer[38] (15) | longformer[39]![]()  | ![]()  | ||
| Neural Architecture Search for Lightweight Non-Local Networks[40] (2) | AutoNL[41]![]()  | ![]()  | ||
| ETC: Encoding Long and Structured Data in Transformers[42] (2) | - | ![]()  | ||
| Multi-scale Transformer Language Models[43] (1) | IN_PAPER | ![]()  | ||
| Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5) | - | ![]()  | ||
| Jukebox: A Generative Model for Music[45] (9) | jukebox[46]![]()  | ![]()  | ||
| GMAT: Global Memory Augmentation for Transformers[47] (0) | gmat[48]![]()  | ![]()  | ||
| Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0) | google-research[50]![]()  | ![]()  | ||
| Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0) | - | ![]()  | ||
| Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1) | fast-transformers[53]![]()  | ![]()  | ||
| Linformer: Self-Attention with Linear Complexity[54] (3) | linformer-pytorch[55]![]()  | ![]()  | ||
| Real-time Semantic Segmentation with Fast Attention[56] (0) | - | ![]()  | ||
| Fast Transformers with Clustered Attention[57] (0) | fast-transformers[58]![]()  | ![]()  | ||
| Big Bird: Transformers for Longer Sequences[59] (0) | - | ![]()  | 
推荐阅读

评论






































