[LLaVA系列]CLIP/LLaVA/LLaVA1.5/VILA笔记-技术圈

作者丨DefTruth

来源丨https://zhuanlan.zhihu.com/p/683137074

编辑丨GiantPandaCV

0x00 前言

本文主要记录一下CLIP和LLaVA系列模型的核心点，便于自己后续复习查找。

0x01 CLIP 模型结构

paper: https://arxiv.org/pdf/2103.00020.pdf

CLIP模型是一个双塔结构，包括一个文本编码器Text Encoder和一个图像编码器Image Encoder。训练数据集的形式为(image, text)，对于每个正确匹配的image和text，text是对image的一句正确描述。CLIP模型需要对(image, text)的数据对进行预测，即(image, text)匹配的为1，不匹配的为0。

Text Encoder: 对于每个句子，将其编码成一个隐向量，维度(1,512)；N个句子，因此有，即[N, 512]
Image Encoder: 对于每张img，将其编码成一个隐向量，维度(1,512)；
N张图，因此有-，即[N, 512]

由于Text Encoder和Image Encoder最后都是输出[N,512]的Tensor，因此可以很方便地计算images和texts两两之间的相似度。CLIP可以选在ResNet或ViT作为Backbone。实验表明，ViT的效果要好于ResNet。

0x02 CLIP 损失函数

CLIP采用对称损失函数，简单来说，就是对相似度矩阵，分别从行方向和列方向计算loss，最后取两者的平均。

伪代码如下：

# image_encoder - ResNet or Vision Transformer# text_encoder - CBOW or Text Transformer# I[n, h, w, c] - minibatch of aligned images# T[n, l] - minibatch of aligned texts# W_i[d_i, d_e] - learned proj of image to embed# W_t[d_t, d_e] - learned proj of text to embed# t - learned temperature parameter# extract feature representations of each modalityI_f = image_encoder(I) #[n, d_i]T_f = text_encoder(T) #[n, d_t]# joint multimodal embedding [n, d_e]I_e = l2_normalize(np.dot(I_f, W_i), axis=1)T_e = l2_normalize(np.dot(T_f, W_t), axis=1)# scaled pairwise cosine similarities [n, n]logits = np.dot(I_e, T_e.T) * np.exp(t)# symmetric loss functionlabels = np.arange(n)loss_i = cross_entropy_loss(logits, labels, axis=0)loss_t = cross_entropy_loss(logits, labels, axis=1)loss = (loss_i + loss_t)/2

0x03 CLIP实践认知

通过代码来验证一下理解。先安装CLIP，参考CLIP官方文档。

$ conda install --yes -c pytorch pytorch torchvision cudatoolkit
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

测试脚本：

import torchimport clipfrom PIL import Imagedevice = "cuda" if torch.cuda.is_available() else "cpu"model, preprocess = clip.load("ViT-B/32", device=device)image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)with torch.no_grad():
    image_features = model.encode_image(image)
    print("image_features shape:", image_features.shape) # [1, 512]
    text_features = model.encode_text(text)
    print("text_features shape:", text_features.shape) # [3, 512]
    
    logits_per_image, logits_per_text = model(image, text)
    print("logits_per_image shape:", logits_per_image.shape) # [1, 3]
    print("logits_per_text shape:", logits_per_text.shape) # [3, 1]
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]print("      Label: {}".format(["a diagram", "a dog", "a cat"]))

0x04 LLaVA模型结构

paper: https://arxiv.org/pdf/2304.08485.pdf

LLaVA的模型结构非常简单，无非就是CLIP+LLM(Vicuna，LLaMA结构)，利用Vison Encoder将图片转换为[N=1, grid_H x grid_W, hidden_dim]的feature map，然后接一个插值层Projection W，将图像特征和文本特征进行维度对齐。经过Projection后，得到[N=1, grid_H x grid_W=image_seqlen, emb_dim]。然后将 image token embedding和text token embedding合并到一起，作为语言模型的输入，生成描述的文本。

0x05 CLIP在LLaVA中的应用

在LLaVA中，Vision Encoder使用的是CLIP-ViT-L/14，并且，需要注意的是，LLaVA使用最后一层Transformer之前或之后的grid features作为图像表示，而不是CLIP最后的输出层。

0x06 LLaVA两阶段训练

阶段一：特征对齐预训练。由于从CLIP提取的特征与word embedding不在同一个语义表达空间，因此，需要通过预训练，将image token embedding对齐到text word embedding的语义表达空间。这个阶段冻结Vision Encoder和LLM模型的权重参数，只训练插值层Projection W的权重。

阶段二：端到端训练。这个阶段，依然冻结Vision Encoder的权重，训练过程中同时更新插值层Projection W和LLM语言模型的权重，训练考虑Multimodal Chatbot和Science QA两种典型的任务。

实验结论：实验结果表明LLaVA在对话、细节描述和复杂推理等任务上均优于BLIP-2。

0x07 LLaVA 1.5

paper: https://arxiv.org/pdf/2310.03744.pdf

LLaVA 1.5和LLaVA在模型架构上基本一致，对LLM模型和插值层做了修改，但是模型效果逐渐开始炸裂~

LLM模型：LLM语言模型升级为Vicuna v1.5 13B，语言模型参数量更大，效果更好

Connector：也就是插值层，由原来的单个线性层替换为MLP层（多层线性层叠加）
Vision Encoder: 输入图像分辨率由224增大为336，使用CLIP ViT-L/336px，对图像细节理解能力更强
更高质量的数据：真所谓 Data is All you need!

这里贴一下 LLaVA 1.5论文的雷达图，之后的LLaVA系列，基本都用这张图作为baseline，卷起来了...

0x08 对OCR任务的影响

LLaVA模型具有in-context learning和Zero-shot multilingual capability的能力，比如OCR任务，不同于以往的深度学习OCR算法（必须单独训练针对OCR任务的模型），LLaVA本身就可以直接应用与OCR，指定合适的Prompt，就可以从图片中提取文字（或者说通用多模态-图生文大模型都具有这种能力）。