用消融的方法让大模型更听话，无需重新训练-技术圈

作者：北方的郎

原文地址：https://zhuanlan.zhihu.com/p/705672834

本文翻译自Maxime Labonne的“Uncensor any LLM with abliteration”

原文地址：https://mlabonne.github.io/blog/posts/2024-06-04_Uncensor_any_LLM_with_abliteration.html/posts/2024-06-04_Uncensor_any_LLM_with_abliteration.html

第三代 Llama 模型提供了微调（指令）版本，在理解和遵循指令方面表现出色。然而，这些模型受到严格审查，旨在拒绝被视为有害的请求，例如“作为人工智能助手，我无法帮助你”。虽然此安全功能对于防止误用至关重要，但它限制了模型的灵活性和响应能力。

在本文中，我们将探索一种称为“消融（abliteration）”的技术，该技术可以在无需重新训练的情况下对任何LLM进行审查。该技术有效地消除了模型内置的拒绝机制，使其能够响应所有类型的提示。

该代码可在 Google Colab （https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing）和 GitHub 上的LLM Course（https://github.com/mlabonne/llm-course-course）中找到。

✂️ 什么是消融（abliteration）？

现代LLM在安全性和遵循指令方面进行了微调，这意味着他们接受过拒绝有害请求的训练。在他们的博客文章中，Arditi 等人。已经表明这种拒绝行为是由模型残差流中的特定方向调节的。如果我们阻止模型表示这个方向，它就会失去拒绝请求的能力。相反，人为添加此方向可能会导致模型拒绝无害的请求。

在传统的类似 Llama 的解码器架构中，我们可以定位三个残差流：每个块的开始处（“pre”）、注意力层和 MLP 层之间（“mid”）以及 MLP 之后（“post”）。下图说明了每个残差流的位置。

为了取消对LLM的审查，我们首先需要确定模型中的“拒绝方向”。这个过程涉及几个技术步骤：

数据收集：在一组有害指令和一组无害指令上运行模型，记录每个指令最后一个令牌位置的残余流激活。
平均差：计算有害指令和无害指令激活之间的平均差。这为我们提供了一个表示模型每一层的“拒绝方向”的向量。
选择：标准化这些向量并评估它们以选择单个最佳“拒绝方向”。

一旦我们确定了拒绝方向，我们就可以“消融”它，有效地消除模型表示该特征的能力。这可以通过推理时间干预或永久地通过权重正交化来完成。

我们先来谈谈推理时间干预。对于写入残差流的每个组件（例如注意力头），我们计算其输出在拒绝方向上的投影，并减去该投影。这种减法应用于每个令牌和每一层，确保模型永远不会代表拒绝方向。

另一方面，权重正交化涉及直接修改模型权重。通过相对于拒绝方向正交化组件权重，它可以防止模型完全向该方向写入。这是通过调整写入残余流的矩阵来实现的，确保它们不会影响拒绝方向。

在下一节中，我们将通过权重正交化实现消除。

实施

以下消融的实现基于FailSpy 的notebook（https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/blob/main/ortho_cookbook.ipynb），而该notebook本身又基于原作者的notebook(https://colab.research.google.com/drive/1a-aQvKC9avdZpdyBn4jgRQFObTPy1JZw?usp=sharing)。我主要对其进行了调整和简化，以使其更容易理解。本节代码量很大，因此您可以看到发生了什么，但如果您对技术细节不太感兴趣，可以使用 FailSpy 的abliterator 库(https://github.com/FailSpy/abliterator)（另请参阅他在 Hugging Face 上收集的 abliterator 模型:https://huggingface.co/collections/failspy/abliterated-v3-664a8ad0db255eefa7d0012b）。

该代码依赖于优秀的TransformerLens库（以前称为 EasyTransformer）来完成繁重的工作。它是为机械可解释性而设计的，在这里用于干预激活。感谢 Neel Nanda 和 Joseph Bloom 创建和维护这个库。

首先，让我们安装必要的包并导入它们。所有这些步骤都可以在这个Google Colab 笔记本中找到。

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping

import torch
import functools
import einops
import gc

from datasets import load_dataset
from tqdm import tqdm
from torch import Tensor
from typing import List
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoModelForCausalLM, AutoTokenizer
from jaxtyping import Float, Int
from collections import defaultdict

# Turn automatic differentiation off to save GPU memory (credit: Undi95)
torch.set_grad_enabled(False)

我们需要两个数据集：一个包含无害指令，另一个包含有害指令。我们将使用tatsu-lab/alpaca以及来自llm-attacks的数据。为了让事情变得更容易，我将它们重新打包到两个 Hugging Face 数据集中：mlabonne/harmless_alpaca和mlabonne/harmful_behaviors。这样，您就可以轻松地将它们替换为您自己的数据集。

我们将加载指令并将其重新格式化为带有“角色”和“内容”键的字典列表。这使得它与apply_chat_tokenizer()我们将用来遵循 Llama 3 聊天模板的方法兼容。

def reformat_texts(texts):
    return [[{"role": "user", "content": text}] for text in texts]

# Get harmful and harmless datasets
def get_harmful_instructions():
    dataset = load_dataset('mlabonne/harmful_behaviors')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

def get_harmless_instructions():
    dataset = load_dataset('mlabonne/harmless_alpaca')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

harmful_inst_train, harmful_inst_test = get_harmful_instructions()
harmless_inst_train, harmless_inst_test = get_harmless_instructions()

现在我们有了数据集，我们可以加载我们想要消除的模型。不幸的是，您无法直接使用加载自定义模型HookedTransformer。在这里，我使用 FailSpy 笔记本中描述的技巧来下载自定义模型并将其重命名为meta-llama/Meta-Llama-3-8B-Instruct。torch.float16如果您的 GPU 与 BF16 不兼容，请以格式加载。

在此示例中，我们将使用mlabonne/Daredevil-8B，这是一个使用 DARE TIES 创建的大型合并（请参阅我关于模型合并的文章），它在 Open LLM 排行榜上的 8B 类别中具有最高的 MMLU 分数。

MODEL_ID = "mlabonne/Daredevil-8B"
MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"

# Download and load model
!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}

# Load model and tokenizer
model = HookedTransformer.from_pretrained_no_processing(
    MODEL_TYPE,
    local_files_only=True,
    dtype=torch.bfloat16,
    default_padding_side='left'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)
tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.eos_token

我们现在可以标记我们的数据集。对于无害和有害的说明，我们使用相同数量的样本。请注意，大量样本可以使用所有 RAM/VRAM，这就是我在这里将其限制为 256 的原因。

def tokenize_instructions(tokenizer, instructions):
    return tokenizer.apply_chat_template(
        instructions,
        padding=True,
        truncation=False,
        return_tensors="pt",
        return_dict=True,
        add_generation_prompt=True,
    ).input_ids

n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))

# Tokenize datasets
harmful_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmful_inst_train[:n_inst_train],
)
harmless_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmless_inst_train[:n_inst_train],
)

一切都准备就绪，我们现在可以实施消除的第一步：数据收集。我们想要处理这些标记化数据集并将残余流激活存储在harmful和中harmless。这是由Transformer_lens库管理的。

# Define batch size based on available VRAM
batch_size = 32

# Initialize defaultdicts to store activations
harmful = defaultdict(list)
harmless = defaultdict(list)

# Process the training data in batches
num_batches = (n_inst_train + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    print(i)
    start_idx = i * batch_size
    end_idx = min(n_inst_train, start_idx + batch_size)

    # Run models on harmful and harmless prompts, cache activations
    harmful_logits, harmful_cache = model.run_with_cache(
        harmful_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )
    harmless_logits, harmless_cache = model.run_with_cache(
        harmless_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )

    # Collect and store the activations
    for key in harmful_cache:
        harmful[key].append(harmful_cache[key])
        harmless[key].append(harmless_cache[key])

    # Flush RAM and VRAM
    del harmful_logits, harmless_logits, harmful_cache, harmless_cache
    gc.collect()
    torch.cuda.empty_cache()

# Concatenate the cached activations
harmful = {k: torch.cat(v) for k, v in harmful.items()}
harmless = {k: torch.cat(v) for k, v in harmless.items()}

我们现在可以计算每层的拒绝方向。这对应于有害和无害指令的激活之间的平均差异，然后将其标准化。我们在中按降序对它们进行排序activation_scored。

# Helper function to get activation index
def get_act_idx(cache_dict, act_name, layer):
    key = (act_name, layer)
    return cache_dict[utils.get_act_name(*key)]

# Compute difference of means between harmful and harmless activations at intermediate layers
activation_layers = ["resid_pre", "resid_mid", "resid_post"]
activation_refusals = defaultdict(list)

for layer_num in range(1, model.cfg.n_layers):
    pos = -1  # Position index

    for layer in activation_layers:
        harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)
        harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(
            dim=0
        )

        refusal_dir = harmful_mean_act - harmless_mean_act
        refusal_dir = refusal_dir / refusal_dir.norm()
        activation_refusals[layer].append(refusal_dir)

# Get all calculated potential refusal directions, sort them in descending order based on their mean
# Use a subset of layers if certain activations are not promising
selected_layers = ["resid_pre"]
activation_scored = sorted(
    [
        activation_refusals[layer][l - 1]
        for l in range(1, model.cfg.n_layers)
        for layer in selected_layers
    ],
    key=lambda x: abs(x.mean()),
    reverse=True,
)

该过程的最后一步包括评估我们计算的拒绝方向。为此，我们将在推理过程中将拒绝方向应用于每个残差流和每个块。在下面的代码片段中，我们得到了 4 个测试有害指令和 20 个块（或层）的生成。

def _generate_with_hooks(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    tokens: Int[Tensor, "batch_size seq_len"],
    max_tokens_generated: int = 64,
    fwd_hooks=[],
) -> List[str]:
    all_tokens = torch.zeros(
        (tokens.shape[0], tokens.shape[1] + max_tokens_generated),
        dtype=torch.long,
        device=tokens.device,
    )
    all_tokens[:, : tokens.shape[1]] = tokens
    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_tokens[:, : -max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(
                dim=-1
            )  # greedy sampling (temperature=0)
            all_tokens[:, -max_tokens_generated + i] = next_tokens
    return tokenizer.batch_decode(
        all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
    )

def get_generations(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    instructions: List[str],
    fwd_hooks=[],
    max_tokens_generated: int = 64,
    batch_size: int = 4,
) -> List[str]:
    generations = []
    for i in tqdm(range(0, len(instructions), batch_size)):
        tokens = tokenize_instructions(
            tokenizer, instructions=instructions[i : i + batch_size]
        )
        generation = _generate_with_hooks(
            model,
            tokenizer,
            tokens,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)
    return generations

# Inference-time intervention hook
def direction_ablation_hook(
    activation: Float[Tensor, "... d_act"],
    hook: HookPoint,
    direction: Float[Tensor, "d_act"],
):
    if activation.device != direction.device:
        direction = direction.to(activation.device)
    proj = (
        einops.einsum(
            activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"
        )
        * direction
    )
    return activation - proj

# Testing baseline
N_INST_TEST = 4
baseline_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)
EVAL_N = 20  # Evaluate how many of the top N potential directions
evals = []
for refusal_dir in tqdm(activation_scored[:EVAL_N]):
    hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)
    fwd_hooks = [
        (utils.get_act_name(act_name, layer), hook_fn)
        for layer in list(range(model.cfg.n_layers))
        for act_name in activation_layers
    ]
    intervention_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
    )
    evals.append(intervention_generations)

我们将所有代都存储在evals列表中。我们现在可以打印它们并手动选择为每条指令提供未经审查的响应的层（块）。我会自动排除包含“我不能”和“我不能”的回复，以过滤掉不需要的答案。

如果找不到满足这些要求的层，您可能需要测试前面selected_layers列表中的其他剩余流、其他指令、附加块等。

# Print generations for human evaluation
blacklist = ["I cannot", "I can't"]
for i in range(N_INST_TEST):
    print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")
    print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")
    for layer_candidate in range(EVAL_N):
        if not any(word in evals[layer_candidate][i] for word in blacklist):
            print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")
            print(evals[layer_candidate][i])

就我而言，候选层 9 设法为四个指令提供未经审查的答案。这是我们将选择的拒绝方向。接下来，我们实现权重正交化来修改权重并防止模型以此方向创建输出。您可以通过打印完成结果来验证模型是否已成功未经审查。

def get_orthogonalized_matrix(
    matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
) -> Float[Tensor, "... d_model"]:
    proj = (
        einops.einsum(
            matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"
        )
        * vec
    )
    return matrix - proj

# Select the layer with the highest potential refusal direction
LAYER_CANDIDATE = 9
refusal_dir = activation_scored[LAYER_CANDIDATE]

# Orthogonalize the model's weights
if refusal_dir.device != model.W_E.device:
    refusal_dir = refusal_dir.to(model.W_E.device)
model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)

for block in tqdm(model.blocks):
    if refusal_dir.device != block.attn.W_O.device:
        refusal_dir = refusal_dir.to(block.attn.W_O.device)
    block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)
    block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)

# Generate text with abliterated model
orthogonalized_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Print generations
for i in range(N_INST_TEST):
    if len(baseline_generations) > i:
        print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")
        print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")
    print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")
    print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

现在我们已经准备好使用该模型了。我们将其转换回 Hugging Face 格式并将其上传到 HF 集线器。

# Convert model back to HF safetensors
hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)
lm_model = hf_model.model

state_dict = model.state_dict()
lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(model.cfg.n_layers):
    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
        einops.rearrange(
            state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads
        ).contiguous()
    )
    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
        torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()
    )

hf_model.push_to_hub(f"{MODEL_ID}-abliterated")
# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

⚖️ DPO Fine-Tuning

我在 Open LLM 排行榜和 Nous 基准套件上评估了上一节中的消融（abliteration）模型和源模型。结果如下：

正如您所看到的，源模型的性能明显优于 Llama 3 8B Instruct。然而，我们观察到所有基准测试中消融版本的性能都有所下降。消融过程成功地对其进行了审查，但也降低了模型的质量。

为了解决这个问题，一个想法是进一步训练我们的消除模型来解决这个问题。与大多数微调模型一样，Llama 3 8B Instruct 在监督微调方面相当脆弱。额外的 SFT 可能会破坏模型的性能。

另外，偏好对齐非常轻，不应该对我们的废除模型进行脑白质切除。DPO 因其易用性和良好的记录而成为一个很好的候选者。为了实现它，我使用了LazyAxolotl和mlabonne/orpo-dpo-mix-40k数据集。这是我使用的配置：

base_model: mlabonne/Daredevil-8B-abliterated
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
save_safetensors: true

rl: dpo
chat_template: chatml
datasets:
  - path: mlabonne/orpo-dpo-mix-40k-flat
    split: train
    type: chatml.intel

dataset_prepared_path:
val_set_size: 0.0
output_dir: ./out

adapter: qlora
lora_model_dir:

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-6
train_on_inputs: false
group_by_length: false

bf16: auto
fp16:
tf32:

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

我使用 6xA6000 GPU 和 DeepSpeed ZeRO-2 对其进行训练。培训时间约6小时45分钟。以下是我从 W&B 获得的训练曲线：

它会自动上传 DPO 微调模型，称为mlabonne/NeuralDaredevil-8B-abliterated。为了看看它是否修复了我们的消融版本，我在相同的基准上对其进行了评估：

我们可以看到，这种额外的训练使我们能够恢复大部分由于消融而导致的性能下降。该模型没有改进的一个领域是 GSM8K（一种数学数据集），这可能意味着 orpo-dpo-mix-40k 将受益于更多的数学样本。

最终模型是未经审查的LLM，具有 8B 类别中最先进的性能。当您不需要审查时，我推荐它作为 Llama 3 8B Instruct 的改进版本。您可以在 LM Studio 中使用量化版本，例如 GGUF。

结论

在这篇文章中，我们介绍了消融（abliteration）的概念。该技术使用模型对无害和有害提示的激活来计算拒绝方向。然后它使用这个方向来修改模型的权重并确保我们停止输出拒绝。该技术还证明了安全微调的脆弱性并引发了伦理方面的考虑。

我们对 Daredevil-8B 应用了消融（abliteration）来取消对其的审查，这也降低了模型的性能。然后，我们使用 DPO 对其进行修复，创建 NeuralDaredevil-8B 模型，这是一个完全未经审查的高质量 8B LLM。消融（abliteration）并不限于消除对齐，并且应该被视为一种无需重新训练的微调形式。事实上，它可以创造性地应用于其他目标，比如 FailSpy 的MopeyMule，它采用忧郁的对话风格。