使用Pytorch从零开始实现BERT

本文介绍: 最后，我们准备好运行模型的训练。长话短说，打开 main.py 脚本文件，检查学习参数并运行。我在 nVid i a GeFo rce 1050ti GPU 上训练了模型。如果支持 cuda，模型将默认在 GPU 上进行训练。EPOCHS = 4嵌入大小为 64，隐藏注意力上下文大小为 36，批量大小为 12，注意力头数量为 4，编码器数量为 1。学习率为 7e-5。我们使用 Tens o rBoa r d 来跟踪训练过程。运行训练脚本后，您应该会看到它如何准备 IMDB 数据集训练开始了。

本博文是尝试创建一个关于如何使用 PyTorch 构建 BERT 架构的完整教程。本教程的完整代码可在pytorch_bert 获取。

BERT 代表 Tr ans form er s 的双向编码器表示。BERT的原始论文：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding，实际上解释了您需要了解的有关 BERT 的所有内容。

老实说，互联网上有很多更好的文章解释 BERT 是什么，例如 BERT Expanded: State of the art language model for NLP。读完本文后，你可能对注意力机制有一些疑问；这篇文章: Illustrated: Self-Attention 解释了注意力。

在本段中我只是想回顾一下 BERT 的思想，并更多地关注实际实现。BERT 同时解决两个任务：

在这里插入图片描述

NSP 是一个二元分类任务。输入两个句子，我们的模型应该能够预测第二个句子是否是第一个句子的真实延续。

句子	NSP类别
I h ave a cat named To m. To m like s t o play wi t h bird s sitting on t he window	i s next
I ha ve a cat named Tom. We wal k together every da y	i s not next

class IMDBBertDataset(Dataset):
    # Define Special tokens as attributes of class
    CLS = '[CLS]'
    PAD = '[PAD]'
    SEP = '[SEP]'
    MASK = '[MASK]'
    UNK = '[UNK]'

    MASK_PERCENTAGE = 0.15  # How much words to mask

    MASKED_INDICES_COLUMN = 'masked_indices'
    TARGET_COLUMN = 'indices'
    NSP_TARGET_COLUMN = 'is_next'
    TOKEN_MASK_COLUMN = 'token_mask'

    OPTIMAL_LENGTH_PERCENTILE = 70

    def __init__(self, path, ds_from=None, ds_to=None, should_include_text=False):
        self.ds: pd.Series = pd.read_csv(path)['review']

        if ds_from is not None or ds_to is not None:
            self.ds = self.ds[ds_from:ds_to]

        self.tokenizer = get_tokenizer('basic_english')
        self.counter = Counter()
        self.vocab = None

        self.optimal_sentence_length = None
        self.should_include_text = should_include_text

        if should_include_text:
            self.columns = ['masked_sentence', self.MASKED_INDICES_COLUMN, 'sentence', self.TARGET_COLUMN,
                            self.TOKEN_MASK_COLUMN,
                            self.NSP_TARGET_COLUMN]
        else:
            self.columns = [self.MASKED_INDICES_COLUMN, self.TARGET_COLUMN, self.TOKEN_MASK_COLUMN,
                            self.NSP_TARGET_COLUMN]
        self.df = self.prepare_dataset()

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        ...
    
    def prepare_dataset() -&gt; pd.DataFrame:
        ...

...
if should_include_text:
    self.columns = ['masked_sentence', self.MASKED_INDICES_COLUMN, 'sentence', self.TARGET_COLUMN,
                    self.TOKEN_MASK_COLUMN,
                    self.NSP_TARGET_COLUMN]
else:
    self.columns = [self.MASKED_INDICES_COLUMN, self.TARGET_COLUMN, self.TOKEN_MASK_COLUMN,
                    self.NSP_TARGET_COLUMN]
...

sentences = []  
nsp = []  
sentence_lens = []

# Split dataset on sentences
for review in self.ds:
    review_sentences = review.split('. ')
    sentences += review_sentences
    self._update_length(review_sentences, sentence_lens)
self.optimal_sentence_length = self._find_optimal_sentence_length(sentence_lens)

['One of the other reviewers has mentioned that after watching just 1 Oz '
 "episode you'll be hooked",
 'They are right, as this is exactly what happened with me.<br /&gt;<br /&gt;The '
 'first thing that struck me about Oz was its brutality and unflinching scenes '
 'of violence, which set in right from the word GO']

def _find_optimal_sentence_length(self, lengths: typing.List[int]):  
    arr = np.array(lengths)  
    return int(np.percentile(arr, self.OPTIMAL_LENGTH_PERCENTILE))

print("Create vocabulary")  
for sentence in tqdm(sentences):  
    s = self.tokenizer(sentence)  
    self.counter.update(s)  
  
self._fill_vocab()

"My cat is Tom" -> ['my', 'cat', 'is', 'tom']

Counter({'the': 6929,
         ',': 5753,
         'and': 3409,
         'a': 3385,
         'of': 3073,
         'to': 2774,
         "'": 2692,
         '.': 2184,
         'is': 2123,
         ...

def _fill_vocab(self):  
    # specials= argument is only in 0.12.0 version  
    # specials=[self.CLS, self.PAD, self.MASK, self.SEP, self.UNK]
    self.vocab = vocab(self.counter, min_freq=2)  

    # 0.11.0 uses this approach to insert specials  
    self.vocab.insert_token(self.CLS, 0)  
    self.vocab.insert_token(self.PAD, 1)  
    self.vocab.insert_token(self.MASK, 2)  
    self.vocab.insert_token(self.SEP, 3)  
    self.vocab.insert_token(self.UNK, 4)  
    self.vocab.set_default_index(4)

self.vocab.lookup_indices(["[CLS]", "this", "works", "[MASK]", "well"])

[0, 29, 1555, 2, 152]

print("Preprocessing dataset")  
for review in tqdm(self.ds):  
    review_sentences = review.split('. ')  
    if len(review_sentences) > 1:  
        for i in range(len(review_sentences) - 1):  
            # True NSP item  
            first, second = self.tokenizer(review_sentences[i]), self.tokenizer(review_sentences[i + 1])  
            nsp.append(self._create_item(first, second, 1))  
  
            # False NSP item  
            first, second = self._select_false_nsp_sentences(sentences)  
            first, second = self.tokenizer(first), self.tokenizer(second)  
            nsp.append(self._create_item(first, second, 0))  
df = pd.DataFrame(nsp, columns=self.columns)

def _create_item(self, first: typing.List[str], second: typing.List[str], target: int = 1):  
    # Create masked sentence item  
    updated_first, first_mask = self._preprocess_sentence(first.copy())  
    updated_second, second_mask = self._preprocess_sentence(second.copy())
    nsp_sentence = updated_first + [self.SEP] + updated_second  
    nsp_indices = self.vocab.lookup_indices(nsp_sentence)  
    inverse_token_mask = first_mask + [True] + second_mask

def _mask_sentence(self, sentence: typing.List[str]):  
    len_s = len(sentence)  
    inverse_token_mask = [True for _ in range(max(len_s, self.optimal_sentence_length))]  
  
    mask_amount = round(len_s * self.MASK_PERCENTAGE)  
    for _ in range(mask_amount):  
        i = random.randint(0, len_s - 1)  
  
        if random.random() < 0.8:  
            sentence[i] = self.MASK  
        else:
            sentence[i] = self.vocab.lookup_token(j)  
        inverse_token_mask[i] = False  
 return sentence, inverse_token_mask

# Create sentence item without masking random words  
first, _ = self._preprocess_sentence(first.copy(), should_mask=False)  
second, _ = self._preprocess_sentence(second.copy(), should_mask=False)  
original_nsp_sentence = first + [self.SEP] + second  
original_nsp_indices = self.vocab.lookup_indices(original_nsp_sentence)

def _pad_sentence(self, sentence: typing.List[str], inverse_token_mask: typing.List[bool] = None):  
    len_s = len(sentence)  
  
    if len_s >= self.optimal_sentence_length:  
        s = sentence[:self.optimal_sentence_length]  
    else:  
        s = sentence + [self.PAD] * (self.optimal_sentence_length - len_s)  
  
    # inverse token mask should be padded as well  
    if inverse_token_mask:  
        len_m = len(inverse_token_mask)  
        if len_m >= self.optimal_sentence_length:  
            inverse_token_mask = inverse_token_mask[:self.optimal_sentence_length]  
        else:  
            inverse_token_mask = inverse_token_mask + [True] * (self.optimal_sentence_length - len_m)  
    return s, inverse_token_mask

...
nsp_sentence = updated_first + [self.SEP] + updated_second  
nsp_indices = self.vocab.lookup_indices(nsp_sentence)
...

                                       masked_sentence  ... is_next
0     [[CLS], [MASK], of, the, other, reviewers, has...  ...       1
1     [[CLS], once, fifteen, arrived, in, the, ameri...  ...       0
2     [[CLS], they, [MASK], [MASK], ,, as, this, is,...  ...       1
3     [[CLS], just, a, [MASK], of, [MASK], young, ma...  ...       0
4     [[CLS], trust, me, [MASK], this, is, [MASK], a...  ...       1
                                                    ...  ...     ...
8873  [[CLS], freshness, crystal, is, here, to, sell...  ...       0
8874  [[CLS], pixar, have, proved, that, they, ', re...  ...       1
8875  [[CLS], [MASK], abandons, her, slapstick, [MAS...  ...       0
8876  [[CLS], they, raise, the, bar, [MASK], ,, and,...  ...       1
8877  [[CLS], he, is, an, amazing, [MASK], artist, ,...  ...       0
[8878 rows x 6 columns]

masked_sentence    [[CLS], one, of, the, other, [MASK], has, ment...
masked_indices     [0, 5, 6, 7, 8, 2, 10, 11, 4825, 13, 2, 15, 16...
sentence           [[CLS], one, of, the, other, reviewers, has, m...
indices            [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
token_mask         [True, True, True, True, False, True, True, Fa...
is_next                                                            1
Name: 0, dtype: object

item = self.df.iloc[idx]

inp = torch.Tensor(item[self.MASKED_INDICES_COLUMN]).long()
token_mask = torch.Tensor(item[self.TOKEN_MASK_COLUMN]).bool()

attention_mask = (inp == self.vocab[self.PAD]).unsqueeze(0)

if item[self.NSP_TARGET_COLUMN] == 0:  
    t = [1, 0]  
else:  
    t = [0, 1]  
  
nsp_target = torch.Tensor(t)

[1, 0] is NOT next
[0, 1] is next

mask_target = torch.Tensor(item[self.TARGET_COLUMN]).long()  
mask_target = mask_target.masked_fill_(token_mask, 0)

Input tokens:   [0, 6, 24, 565, 67, 0, 443, 123, 5, 6, 5, 12, 1, 1, 1]
Input Segments: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Input tokens:   [0, 6, 24, 565, 67, 0, 443, 123, 5, 6, 5, 12, 1, 1, 1]
Input position: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

class JointEmbedding(nn.Module):

    def __init__(self, vocab_size, size):
        super(JointEmbedding, self).__init__()

        self.size = size

        self.token_emb = nn.Embedding(vocab_size, size)
        self.segment_emb = nn.Embedding(vocab_size, size)

        self.norm = nn.LayerNorm(size)

    def forward(self, input_tensor):
        sentence_size = input_tensor.size(-1)
        pos_tensor = self.attention_position(self.size, input_tensor)

        segment_tensor = torch.zeros_like(input_tensor).to(device)
        segment_tensor[:, sentence_size // 2 + 1:] = 1

        output = self.token_emb(input_tensor) + self.segment_emb(segment_tensor) + pos_tensor
        return self.norm(output)

    def attention_position(self, dim, input_tensor):
        batch_size = input_tensor.size(0)
        sentence_size = input_tensor.size(-1)

        pos = torch.arange(sentence_size, dtype=torch.long).to(device)
        d = torch.arange(dim, dtype=torch.long).to(device)
        d = (2 * d / dim)

        pos = pos.unsqueeze(1)
        pos = pos / (1e4 ** d)

        pos[:, ::2] = torch.sin(pos[:, ::2])
        pos[:, 1::2] = torch.cos(pos[:, 1::2])

        return pos.expand(batch_size, *pos.size())

    def numeric_position(self, dim, input_tensor):
        pos_tensor = torch.arange(dim, dtype=torch.long).to(device)
        return pos_tensor.expand_as(input_tensor)

self.token_emb = nn.Embedding(vocab_size, size)
self.segment_emb = nn.Embedding(vocab_size, size)

pos_tensor = self.attention_position(self.size, input_tensor)

segment_tensor = torch.zeros_like(input_tensor).to(device)
segment_tensor[:, sentence_size // 2 + 1:] = 1

output = self.token_emb(input_tensor) + self.segment_emb(segment_tensor) + pos_tensor
return self.norm(output)

class AttentionHead(nn.Module):  
  
    def __init__(self, dim_inp, dim_out):  
        super(AttentionHead, self).__init__()  
  
        self.dim_inp = dim_inp  
  
        self.q = nn.Linear(dim_inp, dim_out)  
        self.k = nn.Linear(dim_inp, dim_out)  
        self.v = nn.Linear(dim_inp, dim_out)  
  
    def forward(self, input_tensor: torch.Tensor, attention_mask: torch.Tensor = None):  
        query, key, value = self.q(input_tensor), self.k(input_tensor), self.v(input_tensor)  
  
        scale = query.size(1) ** 0.5  
        scores = torch.bmm(query, key.transpose(1, 2)) / scale  
  
        scores = scores.masked_fill_(attention_mask, -1e9)  
        attn = f.softmax(scores, dim=-1)  
        context = torch.bmm(attn, value)  
  
        return context

# input tensor is the output of JointEmbedding module
# attention mask is the vector that masks [PAD] tokens
def forward(self, input_tensor: size (2 x 5 x 4), attention_mask: size (2 x 1 x 5)):

query, key, value = size (2 x 5 x 6), size (2 x 5 x 6), size (2 x 5 x 6)

scale = query.size(1) ** 0.5  
scores = torch.bmm(query, key.transpose(1, 2)) / scale = size (2 x 5 x 5)

scores = scores.masked_fill_(attention_mask, -1e9) = size (2 x 5 x 5)

attn = f.softmax(scores, dim=-1) = size (2 x 5 x 5)
context = torch.bmm(attn, value) = size (2 x 5 x 6)

class MultiHeadAttention(nn.Module):  
  
    def __init__(self, num_heads, dim_inp, dim_out):  
        super(MultiHeadAttention, self).__init__()  
  
        self.heads = nn.ModuleList([  
            AttentionHead(dim_inp, dim_out) for _ in range(num_heads)  
        ])  
        self.linear = nn.Linear(dim_out * num_heads, dim_inp)  
        self.norm = nn.LayerNorm(dim_inp)  
  
    def forward(self, input_tensor: torch.Tensor, attention_mask: torch.Tensor):  
        s = [head(input_tensor, attention_mask) for head in self.heads]  
        scores = torch.cat(s, dim=-1)  
        scores = self.linear(scores)  
        return self.norm(scores)

self.linear = nn.Linear(dim_out * num_heads, dim_inp) = nn.Linear(4 * 3, 4)

def forward(self, input_tensor: size (2 x 5 x 4), attention_mask: size (2 x 1 x 5)):

s = [head(input_tensor, attention_mask) for head in self.heads]
s = [
    tensor(2 x 5 x 6),
    tensor(2 x 5 x 6),
    tensor(2 x 5 x 6),
]

scores = torch.cat(s, dim=-1) = tensor(2 x 5 x 18)

scores = self.linear(scores) = tensor(2 x 5 x 4)
return self.norm(scores)

class Encoder(nn.Module):  
  
    def __init__(self, dim_inp, dim_out, attention_heads=4, dropout=0.1):  
        super(Encoder, self).__init__()  
  
        self.attention = MultiHeadAttention(attention_heads, dim_inp, dim_out) 
        self.feed_forward = nn.Sequential(  
            nn.Linear(dim_inp, dim_out),  
            nn.Dropout(dropout),  
            nn.GELU(),  
            nn.Linear(dim_out, dim_inp),  
            nn.Dropout(dropout)  
        )
        self.norm = nn.LayerNorm(dim_inp)  
  
    def forward(self, input_tensor: torch.Tensor, attention_mask: torch.Tensor):  
        context = self.attention(input_tensor, attention_mask)  
        res = self.feed_forward(context)  
        return self.norm(res)

self.feed_forward = nn.Sequential(  
    nn.Linear(dim_inp, dim_out),  
    nn.Dropout(dropout),  
    nn.GELU(),
    nn.Linear(dim_out, dim_inp),
    nn.Dropout(dropout)  
)

def forward(self, input_tensor: torch.Tensor, attention_mask: torch.Tensor):  
    context = self.attention(input_tensor, attention_mask)  
    res = self.feed_forward(context)  
    return self.norm(res)

class BERT(nn.Module):  
  
    def __init__(self, vocab_size, dim_inp, dim_out, attention_heads=4):  
        super(BERT, self).__init__()  
  
        self.embedding = JointEmbedding(vocab_size, dim_inp)  
        self.encoder = Encoder(dim_inp, dim_out, attention_heads)  
  
        self.token_prediction_layer = nn.Linear(dim_inp, vocab_size)  
        self.softmax = nn.LogSoftmax(dim=-1)  
        self.classification_layer = nn.Linear(dim_inp, 2)  
  
    def forward(self, input_tensor: torch.Tensor, attention_mask: torch.Tensor):  
        embedded = self.embedding(input_tensor)  
        encoded = self.encoder(embedded, attention_mask)  
  
        token_predictions = self.token_prediction_layer(encoded)  
  
        first_word = encoded[:, 0, :]  
        return self.softmax(token_predictions), self.classification_layer(first_word)

self.token_prediction_layer = nn.Linear(dim_inp, vocab_size)
self.softmax = nn.LogSoftmax(dim=-1)

self.classification_layer = nn.Linear(dim_inp,  2)

argmax(NSP output) = [1, 0] is NOT next sentence
argmax(NSP output) = [0, 1] is next sentence

embedded = self.embedding(input_tensor)  
encoded = self.encoder(embedded, attention_mask)

token_predictions = self.token_prediction_layer(encoded)  
 
first_word = encoded[:, 0, :]
return self.softmax(token_predictions), self.classification_layer(first_word)

tensorboard --logdir data/logs

class BertTrainer:  
  
    def __init__(self,  
                 model: BERT,  
                 dataset: IMDBBertDataset,  
                 log_dir: Path,  
                 checkpoint_dir: Path = None,  
                 print_progress_every: int = 10,  
                 print_accuracy_every: int = 50,  
                 batch_size: int = 24,  
                 learning_rate: float = 0.005,  
                 epochs: int = 5,  
                 ):  
        self.model = model  
        self.dataset = dataset  
  
        self.batch_size = batch_size  
        self.epochs = epochs  
        self.current_epoch = 0  
  
        self.loader = DataLoader(self.dataset, batch_size=self.batch_size, shuffle=True)  
  
        self.writer = SummaryWriter(str(log_dir))  
        self.checkpoint_dir = checkpoint_dir  
  
        self.criterion = nn.BCEWithLogitsLoss().to(device)  
        self.ml_criterion = nn.NLLLoss(ignore_index=0).to(device)  
        self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=0.015)

self.criterion = nn.BCEWithLogitsLoss().to(device)  
self.ml_criterion = nn.NLLLoss(ignore_index=0).to(device)

def train(self, epoch: int):  
    print(f"Begin epoch {epoch}")  
  
    prev = time.time()  
    average_nsp_loss = 0  
    average_mlm_loss = 0  
    for i, value in enumerate(self.loader):  
        index = i + 1  
        inp, mask, inverse_token_mask, token_target, nsp_target = value  
        self.optimizer.zero_grad()  
  
        token, nsp = self.model(inp, mask)  
  
        tm = inverse_token_mask.unsqueeze(-1).expand_as(token)  
        token = token.masked_fill(tm, 0)  
  
        loss_token = self.ml_criterion(token.transpose(1, 2), token_target)
        loss_nsp = self.criterion(nsp, nsp_target)  
  
        loss = loss_token + loss_nsp  
        average_nsp_loss += loss_nsp  
        average_mlm_loss += loss_token  
  
        loss.backward()  
        self.optimizer.step()  
  
        if index % self._print_every == 0:  
            elapsed = time.gmtime(time.time() - prev)  
            s = self.training_summary(elapsed, index, average_nsp_loss, average_mlm_loss)  
  
            if index % self._accuracy_every == 0:  
                s += self.accuracy_summary(index, token, nsp, token_target, nsp_target)  
  
            print(s)  
  
            average_nsp_loss = 0  
  average_mlm_loss = 0  
  return loss

inp, mask, inverse_token_mask, token_target, nsp_target = value  
self.optimizer.zero_grad()

token, nsp = self.model(inp, mask)

tm = inverse_token_mask.unsqueeze(-1).expand_as(token)  
token = token.masked_fill(tm, 0)

loss_token = self.ml_criterion(token.transpose(1, 2), token_target)
loss_nsp = self.criterion(nsp, nsp_target)  
  
loss = loss_token + loss_nsp  
average_nsp_loss += loss_nsp  
average_mlm_loss += loss_token

loss.backward()  
self.optimizer.step()

if index % self._accuracy_every == 0:  
    s += self.accuracy_summary(index, token, nsp, token_target, nsp_target)

nsp_acc = nsp_accuracy(nsp, nsp_target)  
token_acc = token_accuracy(token, token_target, inverse_token_mask)

def nsp_accuracy(result: torch.Tensor, target: torch.Tensor):
    s = (result.argmax(1) == target.argmax(1)).sum()  
    return round(float(s / result.size(0)), 2)

def token_accuracy(result: torch.Tensor, target: torch.Tensor, inverse_token_mask: torch.Tensor):
    r = result.argmax(-1).masked_select(~inverse_token_mask)  
    t = target.masked_select(~inverse_token_mask)  
    s = (r == t).sum()  
    return round(float(s / (result.size(0) * result.size(1))), 2)

EMB_SIZE = 64  
HIDDEN_SIZE = 36  
EPOCHS = 4  
BATCH_SIZE = 12  
NUM_HEADS = 4

Prepare dataset
Create vocabulary
100%|██████████| 491161/491161 [00:05<00:00, 93957.36it/s]
Preprocessing dataset
100%|██████████| 50000/50000 [00:35<00:00, 1407.99it/s]

Model Summary

===================================
Device: cuda
Training dataset len: 882322
Max / Optimal sentence len: 27
Vocab size: 71942
Batch size: 12
Batched dataset len: 73526
===================================

Begin epoch 0
00:00:02 | Epoch 1 | 20 / 73526 (0.03%) | NSP loss   0.72 | MLM loss  11.25
00:00:04 | Epoch 1 | 40 / 73526 (0.05%) | NSP loss   0.70 | MLM loss  11.22
00:00:06 | Epoch 1 | 60 / 73526 (0.08%) | NSP loss   0.70 | MLM loss  11.13
00:00:08 | Epoch 1 | 80 / 73526 (0.11%) | NSP loss   0.71 | MLM loss  11.13
00:00:11 | Epoch 1 | 100 / 73526 (0.14%) | NSP loss   0.69 | MLM loss  11.05
00:00:13 | Epoch 1 | 120 / 73526 (0.16%) | NSP loss   0.70 | MLM loss  10.98
00:00:15 | Epoch 1 | 140 / 73526 (0.19%) | NSP loss   0.69 | MLM loss  10.95
00:00:18 | Epoch 1 | 160 / 73526 (0.22%) | NSP loss   0.70 | MLM loss  10.90
00:00:20 | Epoch 1 | 180 / 73526 (0.24%) | NSP loss   0.71 | MLM loss  10.89
00:00:22 | Epoch 1 | 200 / 73526 (0.27%) | NSP loss   0.72 | MLM loss  10.83 | NSP accuracy 0.25 | Token accuracy 0.01

02:20:49 | Epoch 1 | 73440 / 73526 (99.88%) | NSP loss   0.69 | MLM loss   4.49
02:20:52 | Epoch 1 | 73460 / 73526 (99.91%) | NSP loss   0.69 | MLM loss   4.37
02:20:54 | Epoch 1 | 73480 / 73526 (99.94%) | NSP loss   0.69 | MLM loss   4.24
02:20:56 | Epoch 1 | 73500 / 73526 (99.96%) | NSP loss   0.69 | MLM loss   4.38
02:20:59 | Epoch 1 | 73520 / 73526 (99.99%) | NSP loss   0.70 | MLM loss   4.37

01:03:01 | Epoch 1 | 32880 / 73526 (44.72%) | NSP loss   0.69 | MLM loss   4.78

02:20:59 | Epoch 1 | 73520 / 73526 (99.99%) | NSP loss   0.70 | MLM loss   4.37

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

引言

下一句话预测 (NSP)

掩码 语言模型 (MLM)

构建 BERT

准备数据集

按句子分割数据集并填充 词汇表

创建训练数据集

步骤1. 对句子进行掩码

步骤2. 预处理：[CLS]和[PAD]

步骤3. 将句子中的单词 转换为整数tokens

NSP目标

MLM目标

构建 pyTorch 模型

联合嵌入(JointEmbedding)

注意力头

多头注意力机制

编码器

BERT

训练模型

训练结果及总结

发表回复取消回复