Spanish Pre-Trained BERT Model and Evaluation Data

Jan 1, 2020·
José Cañete
José Cañete
,
Gabriel Chaperon
,
Rodrigo Fuentes
,
Jou-Hui Ho
,
Hojin Kang
,
Jorge Pérez
· 0 min read
Abstract
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data and the compilation of the Spanish benchmarks.
Type
Publication
Practical ML for Developing Countries Workshop at ICLR 2020, Addis Ababa, Ethiopia
José Cañete
Authors
Expert Machine Learning Engineer | MSc. in Computer Science
My research interests include Artificial Intelligence and how to handle and optimize these systems for production environments.