Maltese crowS-pairs dataset

dataset

posted on 2024-06-19, 07:41 authored by CLAUDIA BORGCLAUDIA BORG, Marthese BorgMarthese Borg

Warning: This dataset contains explicit statements of offensive stereotypes which may be upsetting.

The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group.

In total, we produced 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models.

This file contains the sentence pairs localised to the Maltese context in the Maltese language.

Other languages are available here: https://gitlab.inria.fr/corpus4ethics/multilingualcrowspairs

The paper describing this work is available here: https://www.um.edu.mt/library/oar/handle/123456789/121722

https://aclanthology.org/2024.lrec-main.1545/

To use this dataset, please use the following citation:

Karen Fort, Laura Alonso Alemany, Luciana Benotti, Julien Bezançon, Claudia Borg, Marthese Borg, Yongjian Chen, Fanny Ducel, Yoann Dupont, Guido Ivetta, Zhijian Li, Margot Mieskes, Marco Naguib, Yuyan Qian, Matteo Radaelli, Wolfgang S. Schmeisser-Nieto, Emma Raimundo Schulz, Thiziri Saci, Sarah Saidi, et al.. 2024. Your Stereotypical Mileage May Vary: Practical Challenges of Evaluating Biases in Multiple Languages and Cultural Contexts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17764–17769, Torino, Italia. ELRA and ICCL.