《Revisiting Few-sample BERT Fine-tuning》
逐层学习率衰减:对顶层应用较高学习率而对底层应用较低学习率。
《Universal Language Model Fine-tuning for Text Classification》
判别性微调:允许用不同的学习率调整每一层,而不是对模型的所有层使用相同的学习率。
---
Transformer模型中的不同层通常捕获不同类型的信息,底层通常编码更常见、通用和基础广泛的信息,而靠近输出的顶层编码则更近似本地化和特定任务的信息。
"""
# 顶层选择3.5e-6的学习率,并使用0.9的乘法衰减率从上到下逐层降低学习率(底层embedding和layer0的学习率大致接近1e-6)。
# 隐藏层学习率设置如上,而pooler和regressor的学习率设置为3.6e-6,一个比顶层略高的学习率。
def roberta_base_AdamW_LLRD(model):
opt_parameters = []
named_parameters = list(model.named_parameters())
# 定义不需要正则化的层
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
init_lr = 3.5e-6
head_lr = 3.6e-6
lr = init_lr
# params_0实际上只包含 "pooler.dense.bias"
params_0 = [p for n, p in named_parameters if ("pooler" in n or "regressor" in n) and any(nd in n for nd in no_decay)]
# params_1实际上只包含 "pooler.dense.weight"
params_1 = [p for n, p in named_parameters if ("pooler" in n or "regressor" in n) and not any(nd in n for nd in no_decay)]
head_params = {"params": params_0, "lr": head_lr, "weight_decay": 0.0}
opt_parameters.append(head_params)
head_params = {"params": params_1, "lr": head_lr, "weight_decay": 0.01}
opt_parameters.append(head_params)
# 所以opt_parameters增加pooler的参数,学习率设置为3.6e-6,且区分是否需要正则化
# 通过for循环实现学习率衰减
for layer in range(11, -1, -1):
"""
每一轮的params_0包含如下:
1. encoder.layer.{layer}.attention.self.query.bias
2. encoder.layer.{layer}.attention.self.key.bias
3. encoder.layer.{layer}.attention.self.value.bias
4. encoder.layer.{layer}.attention.output.dense.bias
5. encoder.layer.{layer}.attention.output.LayerNorm.weight
6. encoder.layer.{layer}.attention.output.LayerNorm.bias
7. encoder.layer.{layer}.intermediate.dense.bias
8. encoder.layer.{layer}.output.dense.bias
9. encoder.layer.{layer}.output.LayerNorm.weight
10.encoder.layer.{layer}.output.LayerNorm.bias
"""
params_0 = [p for n, p in named_parameters if f"encoder.layer.{layer}." in n and any(nd in n for nd in no_decay)]
"""
每一轮的params_1包含如下:
1. encoder.layer.{layer}.attention.self.query.weight
2. encoder.layer.{layer}.attention.self.key.weight
3. encoder.layer.{layer}.attention.self.value.weight
4. encoder.layer.{layer}.attention.output.dense.weight
5. encoder.layer.{layer}.intermediate.dense.weight
6. encoder.layer.{layer}.output.dense.weight
"""
params_1 = [p for n, p in named_parameters if f"encoder.layer.{layer}." in n and not any(nd in n for nd in no_decay)]
layer_params = {"params": params_0, "lr": lr, "weight_decay": 0.0}
opt_parameters.append(layer_params)
layer_params = {"params": params_1, "lr": lr, "weight_decay": 0.01}
opt_parameters.append(layer_params)
lr *= 0.9
# params_0包含 "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias"
params_0 = [p for n, p in named_parameters if "embeddings" in n and any(nd in n for nd in no_decay)]
"""
params_1包含
1. "embeddings.word_embeddings.weight"
2. "embeddings.position_embeddings.weight"
3. "embeddings.token_type_embeddings.weight"
"""
params_1 = [p for n, p in named_parameters if "embeddings" in n and not any(nd in n for nd in no_decay)]
embed_params = {"params": params_0, "lr": lr, "weight_decay": 0.0}
opt_parameters.append(embed_params)
embed_params = {"params": params_1, "lr": lr, "weight_decay": 0.01}
opt_parameters.append(embed_params)
return transformers.AdamW(opt_parameters, lr=init_lr)
# 将层分组到不同的集合中,并对每个层应用不同的学习率,即grouped LLRD
# 将roberta-base模型的12个隐藏层分成3组,嵌入层附加到第一组
def roberta_base_AdamW_grouped_LLRD(model):
opt_parameters = []
named_parameters = list(model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
set_2 = ["layer.4", "layer.5", "layer.6", "layer.7"]
set_3 = ["layer.8", "layer.9", "layer.10", "layer.11"]
init_lr = 1e-6
for i, (name, params) in enumerate(named_parameters):
weight_decay = 0.0 if any(p in name for p in no_decay) else 0.01
if name.startswith("embeddings") or name.startswith("encoder"):
lr = init_lr
lr = init_lr * 1.75 if any (p in name for p in set_2) else lr
lr = init_lr * 3.5 if any(p in name for p in set_3) else lr
opt_parameters.append({"params": params, "weight_decay": weight_decay, "lr": lr})
if name.startswith("regressor") or name.startswith("pooler"):
lr = init_lr * 3.6
opt_parameters.append({"params": params, "weight_decay": weight_decay, "lr": lr})
return AdamW(opt_parameters, lr=init_lr)
# 通过层数来区分layer组
def create_optimizer(model):
named_parameters = list(model.named_parameters())
roberta_parameters = named_parameters[:197]
attention_parameters = named_parameters[199:203]
regressor_parameters = named_parameters[203:]
attention_group = [params for (name, params) in attention_parameters]
regressor_group = [params for (name, params) in regressor_parameters]
parameters = []
parameters.append({"params": attention_group})
parameters.append({"params": regressor_group})
for layer_num, (name, params) in enumerate(roberta_parameters):
weight_decay = 0.0 if "bias" in name else 0.01
lr = 2e-5
# layer.4, layer.5, layer.6, layer.7
if layer_num >= 69:
lr = 5e-5
# layer.8, layer.9, layer.10, layer.11
if layer_num >= 133:
lr = 1e-4
parameters.append({"params": params, "weight_decay": weight_decay, "lr": lr})
return AdamW(parameters)
# AdamW用法
"""
AdamW的参数
1. params: 可迭代的参数以优化
2. lr: 学习率。默认值=1e-3
3. weight_decay: 权重衰减系数。默认值=1e-2
4. maximize: 根据目标最大化参数,而不是最小化。默认值=False
在训练过程中固定一部分模型的参数,只更新另一部分参数:
1. 设置不要更新参数的网络层为false
"""
for name, param in model.named_parameters():
if "fc1" in name:
param.requires_grad = False
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2) # 传入的是所有参数
"""
2. 在定义优化器时只传入要更新的参数(最优做法,占用的内存更小,效率更高)
"""
optimizer = torch.optim.SGD(model.fc2.parameters(), lr=1e-2) # 只传入fc2的参数
"""
评论已关闭