魔改Attention大集合

极市导读
如何对attention进行高效改进?本文盘点了相关论文,并梳理出它们的引用量、代码实现、算法复杂度和关键点,方便对比使用。
Efficient Attention
| Paper (引用量) | 源码实现 | 复杂度 | AutoRegressive | Main Idea |
|---|---|---|---|---|
| Generating Wikipedia by Summarizing Long Sequences[1] (208) | memory-compressed-attention[2]![]() | ![]() | ![]() | |
| CBAM: Convolutional Block Attention Module[3] (677) | attention-module[4] ![]() | ![]() | ![]() | |
| CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149) | CCNet[6]![]() | ![]() | ![]() | |
| Efficient Attention: Attention with Linear Complexities[7] (2) | efficient-attention[8]![]() | ![]() | ![]() | |
| Star-Transformer[9] (24) | fastNLP[10]![]() | ![]() | ![]() | |
| Generating Long Sequences with Sparse Transformers[11] (139) | torch-blocksparse[12]![]() | ![]() | ![]() | |
| GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96) | GCNet[14] ![]() | ![]() | ![]() | |
| SCRAM: Spatially Coherent Randomized Attention Maps[15] (1) | - | ![]() | ![]() | |
| Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13) | IN_PAPER | ![]() | ![]() | |
| Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2) | Permutohedral_attention_module[18]![]() | ![]() | ![]() | |
| Large Memory Layers with Product Keys[19] (28) | XLM[20]![]() | ![]() | ![]() | |
| Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38) | EMANet[22]![]() | ![]() | ![]() | |
| Compressive Transformers for Long-Range Sequence Modelling[23] (20) | compressive-transformer-pytorch[24]![]() | ![]() | ![]() | |
| BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8) | BPT[26]![]() | ![]() | ![]() | |
| Axial Attention in Multidimensional Transformers[27] (5) | axial-attention[28]![]() | ![]() | ![]() | |
| Reformer: The Efficient Transformer[29] (69) | trax[30]![]() | ![]() | ![]() | |
| Transformer on a Diet[31] (2) | transformer-on-diet[32]![]() | ![]() | ||
| Sparse Sinkhorn Attention[33] (4) | sinkhorn-transformer[34]![]() | ![]() | ||
| SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1) | - | ![]() | ||
| Efficient Content-Based Sparse Attention with Routing Transformers[36] (11) | routing-transformer[37]![]() | ![]() | ||
| Longformer: The Long-Document Transformer[38] (15) | longformer[39]![]() | ![]() | ||
| Neural Architecture Search for Lightweight Non-Local Networks[40] (2) | AutoNL[41]![]() | ![]() | ||
| ETC: Encoding Long and Structured Data in Transformers[42] (2) | - | ![]() | ||
| Multi-scale Transformer Language Models[43] (1) | IN_PAPER | ![]() | ||
| Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5) | - | ![]() | ||
| Jukebox: A Generative Model for Music[45] (9) | jukebox[46]![]() | ![]() | ||
| GMAT: Global Memory Augmentation for Transformers[47] (0) | gmat[48]![]() | ![]() | ||
| Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0) | google-research[50]![]() | ![]() | ||
| Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0) | - | ![]() | ||
| Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1) | fast-transformers[53]![]() | ![]() | ||
| Linformer: Self-Attention with Linear Complexity[54] (3) | linformer-pytorch[55]![]() | ![]() | ||
| Real-time Semantic Segmentation with Fast Attention[56] (0) | - | ![]() | ||
| Fast Transformers with Clustered Attention[57] (0) | fast-transformers[58]![]() | ![]() | ||
| Big Bird: Transformers for Longer Sequences[59] (0) | - | ![]() |
CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2
attention-module: https://github.com/Jongchan/attention-module
CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2
CCNet: https://github.com/speedinghzl/CCNet
Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8
Star-Transformer: https://arxiv.org/abs/1902.09113v2
fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py
Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1
torch-blocksparse: https://github.com/ptillet/torch-blocksparse
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1
GCNet: https://github.com/xvjiarui/GCNet
SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1
Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2
Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2
Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module
Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2
XLM: https://github.com/facebookresearch/XLM
Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2
EMANet: https://github.com/XiaLiPKU/EMANet
Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1
compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch
BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1
BPT: https://github.com/yzh119/BPT
Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1
axial-attention: https://github.com/lucidrains/axial-attention
Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2
trax: https://github.com/google/trax/tree/master/trax/models/reformer
Transformer on a Diet: https://arxiv.org/abs/2002.06170v1
transformer-on-diet: https://github.com/cgraywang/transformer-on-diet
Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1
sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2
Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1
routing-transformer: https://github.com/lucidrains/routing-transformer
Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1
longformer: https://github.com/allenai/longformer
Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1
AutoNL: https://github.com/LiYingwei/AutoNL
ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2
Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1
Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1
Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1
jukebox: https://github.com/openai/jukebox
GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1
gmat: https://github.com/ag1988/gmat
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1
google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2
fast-transformers: https://github.com/idiap/fast-transformers
Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3
linformer-pytorch: https://github.com/tatp22/linformer-pytorch
Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2
Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1
fast-transformers: https://github.com/idiap/fast-transformers
Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1
A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/
推荐阅读

评论






































