Spanish Pre-Trained BERT Model and Evaluation Data

Jan 1, 2020·

José Cañete

Gabriel Chaperon

Rodrigo Fuentes

Jou-Hui Ho

Hojin Kang

Jorge Pérez

· 0 min read

PDF Cite Code Slides Video Models

Abstract

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data and the compilation of the Spanish benchmarks.

Type

Publication

Practical ML for Developing Countries Workshop at ICLR 2020, Addis Ababa, Ethiopia

Last updated on Jan 1, 2020

Authors

José Cañete

Expert Machine Learning Engineer | MSc. in Computer Science

My research interests include Artificial Intelligence and how to handle and optimize these systems for production environments.

← Two-stage Conditional Chest X-ray Radiology Report Generation Jan 1, 2022