Publications

You can also find my articles on my Google Scholar profile.

Conference Papers


MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs

Published in COLM, 2024

MBBQ (Multilingual Bias Benchmark for Question-answering) is a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, and that there are significant cross-lingual differences in bias behaviour for all except the most accurate models.

Neplenbroek, V., Bisazza, A. and Fernández, R., 2024. MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs. In the first Conference on Language Modeling (COLM) 2024. https://openreview.net/pdf?id=X9yV4lFHt4

Pre-prints


LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Published in arXiv, 2024

We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A. and Martins, A.F., 2024. LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. arXiv preprint arXiv:2406.18403. https://arxiv.org/pdf/2406.18403

Journal Articles