Looking beyond Wales
Welcome back to the long awaited second part in our blog series which sets out to underline the importance of respecting culture in the context of Artificial Intelligence, or AI. If you missed the first blog go here to read it before reading this blog – we hope this blog will then make a lot more sense!
Bias was the main theme of the last blog. As mentioned, bias is problematic. It can lead to cultural flattening [1]. This may sound abstract. But it could mean that future generations will come across their own language online in a form that is different, somehow, a form that has been Anglicized or Americanized. This is the danger when language and culture is subtly reshaped by systems trained outside of their heartlands.
Obviously, considering the number of languages in the world (7,159 according to Ethnologue [2]!) the Welsh language is not alone in terms of AI’s impact on its culture. So, over the course of the next few blogs we are going to look beyond Wales to see what we can learn from other cultures’ experience with AI, before discussing international efforts to ensure or at least move towards culturally suitable AI.
International Research Themes
A number of themes arise when searching the international literature. As you would perhaps expect by now, bias is a clear theme that comes up time and time again. It is discussed in the context of many cultures. Some are smaller cultures such as that of the Māori people [3], others are larger cultures such as that of the Korean people [4]. Bias is also discussed in the context of communities that are not based on language or geography. For example, De Meulder, when discussing what AI could offer deaf people, states:
“While AI tools promise innovation, they also perpetuate biases, reinforce technoableism and deepen inequalities through systemic and design flaws.” [5]
Bias can appear in many — sometimes surprising — forms. For example, American gun culture can appear within a text about Wales (see the previous blog in this series), and western drinking culture can seep into AI output targeting Arab or Muslim contexts [6]!
Conversely, it is possible to use AI to tackle bias and prejudice. Consider Wâsikan Kisewâtisiwin for example, which is a tool created to help non-indigenous people AVOID bias and prejudice when writing about indigenous people [7].
Of course, in order to detect whether bias is present in LLM output, it is necessary to test that output. A clear thread that runs through many of the papers discussed here is the idea that you can’t really fix cultural bias in LLMs unless you first learn how to see it properly, or evaluate its cultural common sense [8] – and for many authors, this starts with building better benchmarks.
According to Mitchell et al. [9] most of the current evaluations are still too English-centric. This means that specific cultural social stereotypes are not noticed. The bias therefore goes unnoticed.
Kim et al. [4] seem to agree, and state that the answer is not to use simple benchmarks or to translate English benchmarks into a smaller language. This makes sense – after all, how do we reflect the richness of culture in simple benchmarks or (mostly mechanical) translations of major English language benchmarks?
Both Mitchell et al. [9] and Kim et al. [4] set out to create better, more suitable benchmarks, based on specific languages and cultural contexts. By doing so they show how bias can emerge in places that standard tests never show.
The same motivation drives Naous et al. [6] who present culture specific evaluations for Arab and Islamic contexts. Their paper shows that even models that perform well on standard benchmarks can produce responses that feel culturally inappropriate or insensitive when examined through a more localized lens.
It is important to note that language and culture are not synonymous. So, including minority, indigenous or non-Western texts in AI training materials is not necessarily the same as including texts relevant to the culture in question!
In short, the international literature tells us two things: bias is widespread, and we are still learning how to measure it properly.
But if bias is difficult to detect and evaluate, the next question is whether the problem is getting worse over time. Some researchers suggest that it is — dramatically in fact. How so? Tune in next time to read about the Doom Spiral!! Bye for now!
Bibliography
[1] Yu, H., S. Jeong, S. Pawar, J. Shin, J. Jin, J. Myung, A. Oh ac I. Augenstein (2026). Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models.
[2] Ethnologue. (2026). “How Many Languages Are There In The World?” Retrieved 03/01/26, from https://www.ethnologue.com/insights/how-many-languages/.
[3] Duncan, S., G. Leoni, L. Steven, K. Mahelona a P.-L. Jones (2024). Fit for our purpose, not yours: Benchmark for a low-resource, Indigenous language. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[4] Kim, E., J. Suk, P. Oh, H. Yoo, J. Thorne ac A. Oh (2024). CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean, Torino, Italia, ELRA and ICCL.
[5]De Meulder, M. (2026). “Deaf in AI: AI language technologies and the erosion of linguistic rights.” Language and Law / Linguagem e Direito 12(1).
[6] Naous, T., M. J. Ryan, A. Ritter a W. Xu (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models.
[7] Wâsikan Kisewâtisiwin. (2026). “AI With Heart Indigenous powered AI.” Retrieved 23/01/26, from https://www.wasikankisewatisiwin.ca/.
[8] Myung, J., N. Lee, Y. Zhou, J. Jin, R. A. Putri, D. Antypas, H. Borkakoty, E. Kim, C. Perez-Almendros, A. A. Ayele, V. Gutiérrez-Basulto, Y. Ibáñez-García, H. Lee, S. H. Muhammad, K. Park, A. S. Rzayev, N. White, S. M. Yimam, M. T. Pilehvar, N. Ousidhoum, J. Camacho-Collados ac A. Oh (2025). BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages.
[9] Mitchell, M., G. Attanasio, I. Baldini, M. Clinciu, J. Clive, P. Delobelle, M. Dey, S. Hamilton, T. Dill, J. Doughman, R. Dutt, A. Ghosh, J. Z. Forde, C. Holtermann, L.-A. Kaffee, T. Laud, A. Lauscher, R. L. Lopez-Davila, M. Masoud, N. Nangia, A. Ovalle, G. Pistilli, D. Radev, B. Savoldi, V. Raheja, J. Qin, E. Ploeger, A. Subramonian, K. Dhole, K. Sun, A. Djanibekov, J. Mansurov, K. Yin, E. V. Cueva, S. Mukherjee, J. Huang, X. Shen, J. Gala, H. Al-Ali, T. Djanibekov, N. Mukhituly, S. Nie, S. Sharma, K. Stanczak, E. Szczechla, T. Timponi Torrent, D. Tunuguntla, M. Viridiano, O. Van Der Wal, A. Yakefu, A. Névéol, M. Zhang, S. Zink a Z. Talat (2025). SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models, Albuquerque, New Mexico, Association for Computational Linguistics.
