Comparing and Mitigating Bias and Toxicity Across Languages
Date:
Large language models (LLMs) are being used by vast amounts of speakers over the world, and show remarkable performance in many non-English languages. However, they often only receive safety fine-tuning in English, if at all, and their performance is known to be inconsistent across languages. There is therefore a need to investigate to what extent LLMs exhibit harmful biases and toxic behaviors across languages and how such harmful behaviors can best be reduced. In this talk I will discuss my work which shows that stereotypical bias exhibited by LLMs differs significantly depending on the language they are prompted in. Furthermore, we show that mitigation of these stereotypical biases and toxic behaviors performed in English transfers to other languages, though often at the expense of decreased language generation ability in those non-English languages.