RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use
- Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt
- Publication Date: 2022-11
- Publication Venue: Pre-print
- Abstract: Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.
Citation
APA
Delobelle, P., Winters, T., & Berendt, B. (2022). RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use. arXiv Preprint arXiv:2211.08192.
Harvard
Delobelle, P., Winters, T. and Berendt, B. (2022) “RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use,” arXiv preprint arXiv:2211.08192 [Preprint].
Vancouver
1.
Delobelle P, Winters T, Berendt B. RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use. arXiv preprint arXiv:221108192. 2022;
BibTeX
Related project
RobBERT
State-of-the-art Dutch language model