Proceedings of the International Conference on Smart Systems and Social Management (ICSSSM-2 2025)

Why Multilingual Transformers Fail for Khasi: A Linguistic Analysis of Low-Resource Austroasiatic AI Gaps

Authors
Badal Nyalang1, *
1MWire Labs, Shillong, Meghalaya, India
*Corresponding author. Email: nyalang@mwirelabs.com
Corresponding Author
Badal Nyalang
Available Online 31 December 2025.
DOI
10.2991/978-2-38476-533-1_6How to use a DOI?
Keywords
Computational Linguistics; Linguistic Typology; Low-Resource NLP; Model Failure Analysis; Khasi Language; Digital Divide; Social Management
Abstract

Multilingual language models such as mBERT and XLM-R have revolutionized natural language processing by extending coverage across hundreds of languages. Yet for typologically distinct, low-resource languages like Khasi, these models often fail to capture even basic morphosyntactic structure. In this paper, we present a systematic analysis of transformer failures on Khasi, an Austroasiatic language spoken in Northeast India. Through diagnostic tasks including tokenization fragmentation, masked language modeling, and few-shot part-of-speech tagging, we show that multilingual models collapse into trivial baselines, yielding unreliable predictions. We argue that this failure is rooted in tokenization bias and typological divergence, leading to misleading evaluations in few-shot settings. Beyond technical diagnosis, our findings underscore the risks of deploying multilingual AI for digital governance, education, and social management in underrepresented regions. This case study of Khasi highlights the urgent need for inclusive AI approaches to prevent digital exclusion and ensure equitable access to smart technologies.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Smart Systems and Social Management (ICSSSM-2 2025)
Series
Advances in Social Science, Education and Humanities Research
Publication Date
31 December 2025
ISBN
978-2-38476-533-1
ISSN
2352-5398
DOI
10.2991/978-2-38476-533-1_6How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Badal Nyalang
PY  - 2025
DA  - 2025/12/31
TI  - Why Multilingual Transformers Fail for Khasi: A Linguistic Analysis of Low-Resource Austroasiatic AI Gaps
BT  - Proceedings of the International Conference on Smart Systems and Social Management (ICSSSM-2 2025)
PB  - Atlantis Press
SP  - 62
EP  - 68
SN  - 2352-5398
UR  - https://doi.org/10.2991/978-2-38476-533-1_6
DO  - 10.2991/978-2-38476-533-1_6
ID  - Nyalang2025
ER  -