RobBERT is an artificial intelligence model that has examined hundreds of millions of Dutch sentences, enabling it to excel in a wide variety of Dutch language tasks. You can show RobBERT a new dataset containing some new Dutch sentences and make it perform a particular task on them. For example, by giving it a dataset of positive and negative book reviews, it can then predict with 94% accuracy whether an unseen book review is positive or negative. By giving it sentences where the Dutch word "die" or "dat" is missing, it can correctly predict these for more than 98% of the cases.
RobBERT is, however, not just limited to these types of tasks but can be used for many different types of language-based tasks, such as comparing sentences, labeling words in sentences, and classifying texts. It is currently the state-of-the-art model for most Dutch language tasks. As such, it has been successfully used by many researchers and practitioners for achieving state-of-the-art performance for a wide range of Dutch natural language processing tasks, including:
- Emotion detection
- Sentiment analysis (book reviews, news articles*)
- Coreference resolution
- Named entity recognition (CoNLL, job titles*, SoNaR)
- Part-of-speech tagging (Small UD Lassy, CGN)
- Zero-shot word prediction
- Humor detection
- Cyberbullying detection
- Correcting dt-spelling mistakes*
and also achieved outstanding, near-sota results for:
* Note that several evaluations use RobBERT-v1, and that the second and improved RobBERT-v2 outperforms this first model on everything we tested
(Also note that this list is not exhaustive. If you used RobBERT for your application, we are happy to know about it! Send us a mail and we'll add your application to the list!)
More in-depth information about RobBERT can be found in our blog post, our paper and on our GitHub repository.
Technical explanation
RobBERT is the state-of-the-art Dutch BERT language model. More specifically, it is the RoBERTa-based Dutch language model that achieves state-of-the-art results on many different Dutch language tasks. This transformer model was trained on the Dutch portion of the OSCAR dataset using Facebook AI's RoBERTa framework, which is an improved version of Google's BERT model.
How to use
RobBERT uses the RoBERTa architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using code to finetune RoBERTa models and most code used for BERT models, e.g., as provided by HuggingFace Transformers library.
Using this model in your Python code is ridiculously easy! Just add the ๐ค transformers repository as a dependency, and load the model using the following lines of code:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")
By default, RobBERT has the masked language model head used in training.
This can be used as a zero-shot way to fill masks in sentences.
It can be tested out for free on RobBERT's Hosted infererence API of Huggingface.
You can also create a new prediction head for your own task by using any of HuggingFace's BERT and RoBERTa-runners,
their fine-tuning notebooks or any RoBERTa-specific classes
by changing the model name to pdelobelle/robbert-v2-dutch-base
.
Paper & blog post
For more information, you can also read our EMNLP Findings paper. Pieter Delobelle has also thoroughly described the internals of the RobBERT model in an excellent blog post on his website.