LLM与多模态必读论文100篇-技术圈

为了写本ChatGPT笔记，过去两个月翻了大量中英文资料/paper(中间一度花了大量时间去深入RL)，大部分时间读的更多是中文资料。

2月最后几天读的更多是英文paper，正是2月底这最后几天对ChatGPT背后技术原理的研究才真正进入状态(后还组建了一个“ChatGPT之100篇论文阅读组”，我和10来位博士、业界大佬从23年2.27日起读完ChatGPT相关技术的100篇论文，如果你想加入100篇论文阅读组，可以下方扫码加入

↓↓↓扫码抢购↓↓↓

读的论文越多，你会发现大部分人对ChatGPT的技术解读都是不够准确或全面的，毕竟很多人没有那个工作需要或研究需要，去深入了解各种细节。

因为半年内100篇这个任务，让自己有史以来一篇一篇一行一行读100篇，之前看的比较散不系统抠的也不细，比如回顾“Attention is all you need”这篇后，对优化博客内的Transformer笔记便有了很多心得。总之，读的论文越多，博客内相关笔记的质量将飞速提升自己的技术研究能力也能有巨大飞跃。考虑到为避免上篇文章篇幅太长而影响完读率，故把这100篇(后增至150篇)论文的清单抽取出来独立成本文： 第一部分 OpenAI/Google的基础语言大模型(11篇，总11篇)

Improving Language Understanding by Generative Pre-Training

GPT原始论文

Language Models are Unsupervised Multitask Learners

GPT2原始论文

Language Models are Few-Shot Learners GPT3原始论文

Training language models to follow instructions with human feedback InstructGPT原始论文

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 19年10月，Google发布T5模型(transfer text to text transformer)，虽也基于transformer，但区别于BERT的编码器架构与GPT的解码器架构，T5是transformer的encoder-decoder架构，这是解读之一的用的750G的训练数据，其训练方法则为：BERT-style的MASK法/replace span(小段替换)/Drop法，以及类似BERT对文本的15%做破坏、且replace span时对3的小段破坏

LaMDA: Language Models for Dialog Applications 论文发布于22年1月，显示LaMDA的参数高达137B，用的transformer decoder架构，这是简要解读之一 21年5月，Google对外宣布内部正在研发对话模型LaMDA，基于transformer decoder架构，在微调阶段使用58K的对话数据，过程类似真人的对话过程，给定一个Query，比如 How old is Rafael Nadal? ，如果人知道答案，那么直接回答35岁即可，如果不知道，则需要去 Research 一下，借助搜索引擎找到答案，然后再回答35岁

《Finetuned Language Models Are Zero-Shot Learners》 21年9月，Google提出FLAN大模型，其基于LaMDA-PT做Instruction Fine-Tuning FLAN is the instruction-tuned version of LaMDA-PT

PaLM: Scaling Language Modeling with Pathways 22年3月，Google的Barham等人发布了Pathways系统，用于更高效地训练大型模型 Pathways 的愿景 —— 一个很接近人脑的框架：一个模型，可以做多任务，多模态且在做任务时，只是 sparsely activated，只使用一部分的参数 22年4月，Google发布PaLM模型，基于Transformer decoder架构，参数规模最大的版本达到惊人的5400亿参数(8B 62B 540B)，使用multi-query注意力、SwiGLU激活函数以及RoPE位置嵌入，这是翻译之一且在每个Transformer块中使用 "平行 "表述(Wang & Komatsuzaki,2021) 是Google的Pathways架构或OpenAI GPT2/3提出的小样本学习的进一步扩展 PaLM首次展示了Pathways的大规模使用——能够以高效的方式在数千或数万个加速器芯片上训练一个模型具体来说，通过Pathways，PaLM 540B在两个通过数据中心网络连接的TPU v4 Pod上训练，使用模型和数据并行的组合，在每个Pod中使用3072个TPU v4芯片，连接到768台主机，能够有效地将训练扩展到6144个芯片，而不需要使用任何pipeline并行，其效率水平是以前这种规模的模型所不能达到的以前的大多数大型语言模型要么是在单个TPU系统上训练的(比如GLaM by Du等人2021年，LaMDA by Thopilan等人) 要么是使用由Huang等人在2019年提出的pipeline并行，从而在GPU集群(Megatron-Turing NLG 530B by Smith等人2022年)，或多个TPU v3 pod(Gopher by Rae等人2021年)上扩展，最大规模为4096个TPU v3芯片另，在自然语言、代码和数学推理等任务中表现的都很不错此外，预训练数据集由一个7800亿个token组成的语料库，该数据集是由过滤过的网页(占比27%)、书籍(占比13%)、Wikipedia(占比4%)、新闻文章(占比1%)、Github源代码(占比5%，包括Java、HTML、Javascript、Python、PHP、C#、XML、C++和C，总计196GB的源代码)，和社交媒体对话(占比50%)组成的，这个数据集是也用于训练LaMDA和GLaM

Constitutional AI: Harmlessness from AI Feedback OpenAI之前一副总裁离职搞了个ChatGPT的竞品，ChatGPT用人类偏好训练RM再RL(即RLHF)，Claude则基于AI偏好模型训练RM再RL(即RLAIF)

Improving alignment of dialogue agents via targeted human judgements DeepMind的Sparrow，这个工作发表时间稍晚于instructGPT，其大致的技术思路和框架与 instructGPT 的三阶段基本类似，但Sparrow 中把奖励模型分为两个不同 RM 的思路

GPT-4 Technical Report

增加了多模态能力的GPT4的技术报告

第二部分 LLM的关键技术：ICL/CoT/RLHF/词嵌入/位置编码/加速/与KG结合等(38篇，总49篇)

Attention Is All You Need
Transformer原始论文

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
这篇文章则将ICL看作是一种隐式的Fine-tuning

A Survey on In-context Learning

Noisy Channel Language Model Prompting for Few-Shot Text Classification

MetaICL: Learning to Learn In Context

https://github.com/dqxiu/ICL_PaperList in-context learning
研究梳理In-Context Learning到底有没有Learning？

Meta-learning via Language Model In-context Tuning

Evaluating Large Language Models Trained on Code
Codex原始论文

Chain-of-Thought Prompting Elicits Reasoning in Large Language
CoT原始论文，也从侧面印证，instructGPT从22年1月份之前就开始迭代了

Large Language Models are Zero-Shot Reasoners
来自东京大学和谷歌的工作，关于预训练大型语言模型的推理能力的探究，“Let's think step by step”的梗即来源于此篇论文

Emergent Abilities of Large Language Models
Google 22年8月份发的，探讨大语言模型的涌现能力

Multimodal Chain-of-Thought Reasoning in Language Models
23年2月，亚马逊的研究者则在这篇论文里提出了基于多模态思维链技术改进语言模型复杂推理能力的思想

TRPO论文

Proximal Policy Optimization Algorithms
2017年，OpenAI发布的PPO原始论文

RLHF原始论文

Scaling Instruction-Finetuned Language Models
微调PaLM-540B(2022年10月)
从三个方面改变指令微调，一是改变模型参数，提升到了540B，二是增加到了1836个微调任务，三是加上Chain of thought微调的数据

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Fine-Tuning Language Models from Human Preferences

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
LoRA论文

Prefix-Tuning: Optimizing Continuous Prompts for Generation
新增Prefix Tuning论文

P-Tuning微调论文

Distributed Representations of Sentences and Documents
Mikolov首次提出 Word2vec

Efficient estimation of word representations in vector space
Mikolov专门讲训练 Word2vec 中的两个trick：hierarchical softmax 和 negative sampling

word2vec Explained- Deriving Mikolov et al.’s Negative-Sampling
Word-Embedding Method
Yoav Goldberg关于word2vec的论文，对 negative-sampling 的公式推导非常完备

word2vec Parameter Learning Explained
Xin Rong关于word2vec的论文，非常不错

ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
旋转位置嵌入(RoPE)论文

Linearized Relative Positional Encoding
统一了适用于linear transformer的相对位置编码

SEARCHING FOR ACTIVATION FUNCTIONS
SwiGLU的原始论文

《The Natural Language Decathlon:Multitask Learning as Question Answering》
GPT-1、GPT-2论文的引用文献，Salesforce发表的一篇文章，写出了多任务单模型的根本思想

Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO是微软deepspeed的核心

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Megatron-LM 论文原始论文

Efficient sequence modeling综述
包含sparse transformer、linear transformer(cosformer，transnormer）RNN(RWKV、S4)，Long Conv(TNN、H3）

Vicuna tackle the memory pressure by utilizing gradient checkpointing and flash attention
Training Deep Nets with Sublinear Memory Cost

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Unifying Large Language Models and Knowledge Graphs: A Roadmap
LLM与知识图谱的结合实战

Fast Transformer Decoding: One Write-Head is All You Need
Muti Query Attention论文，MQA 是 19 年提出的一种新的 Attention 机制，其能够在保证模型效果的同时加快 decoder 生成 token 的速度

GQA: Training Generalized Multi-Query Transformer Models fromMulti-Head Checkpoints
Grouped-Query Attention论文

Flashattention: Fast and memory-efficient exact attention with io-awareness
Flash Attention论文

第三部分 Meta等公司发布的类ChatGPT开源模型和各种微调(7篇，总56篇)

LLaMA: Open and Efficient Foundation Language Models
2023年2月24日Meta发布了全新的65B参数大语言模型LLaMA，开源，大部分任务的效果好于2020年的GPT-3

SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions
23年3月中旬，斯坦福发布Alpaca：只花100美元，人人都可微调Meta家70亿参数的LLaMA大模型，而斯坦福团队微调LLaMA的方法，便是来自华盛顿大学Yizhong Wang等去年底提出的这个Self-Instruct

Alpaca: A Strong Open-Source Instruction-Following Model

Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

GLM: General Language Model Pretraining with Autoregressive Blank Infilling
2022年5月，正式提出了GLM框架

GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL
GLM-130B便是基于的GLM框架的大语言模型

第四部分具备多模态能力的大语言模型(11篇，总67篇)

BEiT: BERT Pre-Training of Image Transformers

BEiT-2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks
2022年8月，微软提出的多模态预训练模型BEiT-3

Language Is Not All You Need: Aligning Perception with Language Models
微软23年3月1日发布的多模态大语言模型Kosmos-1的论文

PaLM-E: An Embodied Multimodal Language Model(论文地址)
Google于23年3月6日发布的关于多模态LLM：PaLM-E，可让能听懂人类指令且具备视觉能力的机器人干活

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
微软于23年3月8日推出visual ChatGPT

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Flamingo: a visual language model for few-shot learning

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022

Language models are unsupervised multitask learners. 2019

Improving language understanding by generative pre-training. 2018

第五部分 AI绘画与多模态能力背后的核心技术(21篇，总88篇)

End-to-End Object Detection with Transformers
DETR by 2020年5月

AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
发表于2020年10月的Vision Transformer原始论文，代表Transformer正式杀入CV界

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
发表于21年3月

Swin Transformer V2: Scaling Up Capacity and Resolution

Auto-Encoding Variational Bayes

Denoising Diffusion Probabilistic Models
2020年6月提出DDPM，即众人口中常说的diffusion model

Diffusion Models Beat GANs on Image Synthesis
使用classifier guidance的方法，引导模型进行采样和生成

High-Resolution Image Synthesis with Latent Diffusion Models
2022年8月发布的Stable Diffusion基于Latent Diffusion Models，专门用于文图生成任务

Aligning Text-to-Image Models using Human Feedback
ChatGPT的主要成功要归结于采用RLHF来精调LLM，近日谷歌AI团队将类似的思路用于文生图大模型：基于人类反馈（Human Feedback）来精调Stable Diffusion模型来提升生成效果

CLIP: Connecting Text and Images - OpenAI
这是针对CLIP论文的解读之一 CLIP由OpenAI在2021年1月发布，超大规模模型预训练提取视觉特征，图片和文本之间的对比学习

Zero-Shot Text-to-Image Generation
DALL·E原始论文

Hierarchical Text-Conditional Image Generation with CLIP Latents
DALL·E 2论文2022年4月发布(至于第一代发布于2021年初)，通过CLIP + Diffusion models，达到文本生成图像新高度

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
23年5月发布的InstructBLIP论文

LAVIS: A Library for Language-Vision Intelligence
Salesforce开源一站式视觉语言学习框架LAVIS，这是其GitHub地址：https://github.com/salesforce/LAVIS

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
对各种多模态模型的评测

Segment Anything
23年4.6日，Meta发布史上首个图像分割基础模型SAM，将NLP领域的prompt范式引进CV，让模型可以通过prompt一键抠图。网友直呼：CV不存在了!

A Comprehensive Survey on Segment Anything Model for Vision and Beyond
对分割一切模型SAM的首篇全面综述：28页、200+篇参考文献

Fast Segment Anything
中科院版的分割一切

MobileSAM
比SAM小60倍，比FastSAM快4倍，速度和效果双赢

第六部分预训练模型的发展演变史(3篇，总91篇)

A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
预训练基础模型的演变史

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

第七部分垂域版类ChatGPT(比如医疗GPT)和其它(10篇，总100篇)

Large Language Models Encode Clinical Knowledge

Towards Expert-Level Medical Question Answering with Large Language Models
继上篇论文提出medpalm之后，5月16日，Google Research和DeepMind发布了Med-PaLM 2，相比第一代最显著的改进是基座模型换成了Google的最新大模型PaLM2(据说有着340b参数，用于训练的token数达3.6万亿)