Why Multilingual Transformers Fail for Khasi: A Linguistic Analysis of Low-Resource Austroasiatic AI Gaps
- DOI
- 10.2991/978-2-38476-533-1_6How to use a DOI?
- Keywords
- Computational Linguistics; Linguistic Typology; Low-Resource NLP; Model Failure Analysis; Khasi Language; Digital Divide; Social Management
- Abstract
Multilingual language models such as mBERT and XLM-R have revolutionized natural language processing by extending coverage across hundreds of languages. Yet for typologically distinct, low-resource languages like Khasi, these models often fail to capture even basic morphosyntactic structure. In this paper, we present a systematic analysis of transformer failures on Khasi, an Austroasiatic language spoken in Northeast India. Through diagnostic tasks including tokenization fragmentation, masked language modeling, and few-shot part-of-speech tagging, we show that multilingual models collapse into trivial baselines, yielding unreliable predictions. We argue that this failure is rooted in tokenization bias and typological divergence, leading to misleading evaluations in few-shot settings. Beyond technical diagnosis, our findings underscore the risks of deploying multilingual AI for digital governance, education, and social management in underrepresented regions. This case study of Khasi highlights the urgent need for inclusive AI approaches to prevent digital exclusion and ensure equitable access to smart technologies.
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Badal Nyalang PY - 2025 DA - 2025/12/31 TI - Why Multilingual Transformers Fail for Khasi: A Linguistic Analysis of Low-Resource Austroasiatic AI Gaps BT - Proceedings of the International Conference on Smart Systems and Social Management (ICSSSM-2 2025) PB - Atlantis Press SP - 62 EP - 68 SN - 2352-5398 UR - https://doi.org/10.2991/978-2-38476-533-1_6 DO - 10.2991/978-2-38476-533-1_6 ID - Nyalang2025 ER -