Khmer News Classification in Low-Resource Settings: A comparative Analysis of Embedding Method

Keywords: text classification, low resource language, khmer language, news classification, contextual embedding

Abstract

Text classification in low-resource languages like Khmer remains challenging due to linguistic complexity, limited annotated data, and noise from real-world applications. This study addresses these challenges by systematically comparing text embedding techniques for Khmer news classification. We evaluate traditional methods (TF-IDF with SVM) against state-of-the-art multilingual transformers (XLM-RoBERTa, LaBSE) using a self-collected dataset of 7,344 Khmer news articles across six categories—political, economic, entertainment, sport, technology, and life. The dataset intentionally retains noise (e.g., mixed-language text, unstructured formatting) to reflect practical scenarios. To address Khmer's lack of word boundaries, we employ word segmentation via khmer-nltk for traditional models, while transformer models leverage their inherent subword tokenization. Experiments reveal that transformer-based embeddings achieve superior performance, with XLM-RoBERTa and LaBSE attaining F1 scores of 94.2% and 94.3%, respectively, outperforming TF-IDF (93.3%). However, the "life" category proves challenging across all models (F1: 85.5–88.1%), likely due to semantic overlap with other categories. Our findings underscore the effectiveness of transformer architectures in capturing contextual nuances for low-resource languages, even with noisy data. This work offers insights for NLP researchers and practitioners, emphasizing the need for domain-specific adaptations and expanded datasets to improve performance in underrepresented languages.

Author Biographies

Natt Korat, Cambodia Academy of Digital Technology

Korat Natt received his B.A. in Psychology from the Royal University of Phnom Penh in 2022 and his B.S. in Computer Science from AGA Institute in 2023. He is currently pursuing his M.S. in Computer Science and serves as a lecturer and research assistant at the Cambodia Academy of Digital Technology (CADT). His research
interests include computer vision, machine learning, natural language processing, and large language models.
Email: natt.korat@cadt.edu.kh

Sopagna Heang, Cambodia Academy of Digital Technology

Heang Sopagna received his Engineer’s degree in Computer Science from the Institute of Technology of Cambodia (ITC) in 2023. He is currently pursuing his M.S. in Computer Science and serves as a lecturer and research assistant at the Cambodia Academy of Digital Technology (CADT). His research interests include machine learning, natural language processing, and large language models.
Email: sopagna.heang@cadt.edu.kh

Vathna Lay, Cambodia Academy of Digital Technology

Lay Vathna received his Engineer’s degree in Computer Science from the Institute of Technology of  Cambodia (ITC) and a Postgraduate Diploma in Mobile Computing from the Center for Development of
Advanced Computing (CDAC), India. He also holds a Master’s degree in ICT. He currently serves as a senior lecturer and Researcher at the Cambodia Academy of Digital Technology (CADT), where he has taught since 2016. His research interests include machine learning, natural language processing, large language models, smart cities, and cybersecurity.
Email: Vathna.lay@cadt.edu.kh

References

Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, “A survey on text classification: From traditional

to deep learning,” ACM Transactions on Intellingent System and Technology (TIST), vol. 13, 2022.

Z. Wan, “Text classification: A perspective of deep learning methods,” 2023.

K. Taha, P. D. Yoo, C. Yeun, D. Homouz, and A. Taha, “A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights,” Computer Science Review, vol. 54, p. 100664, 2024.

M. Das, S. K., and P. J. A. Alphonse, “A comparative study on tf-idf feature weighting method and its analysis using unstructured dataset,” 2023.

M. Hearst, S. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their Applications, vol. 13, no. 4, pp. 18–28, 1998.

N. S. Wardana, F. P. Aditiawan, and A. P. Sari, “Logistic regression classification with tf-idf and fasttext for sentiment analysis of linkedin reviews,” VISA: Journal of Vision and Ideas, vol. 4, p. 1359 –, Aug. 2024.

A. Magueresse, V. Carles, and E. Heetderks, “Low-resource languages: A review of past work and future challenges,” 2020.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2019.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pretraining,” in OpenAI Blog, 2018.

C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune bert for text classification?,” arXiv preprint arXiv:1905.05583, 2020.

R. Buoy, N. Taing, and S. Chenda, “Khmer text classification using word embedding and neural network,” arXiv preprint, 2021.

L. Rokach and O. Maimon, “Decision trees,” The Data Mining and Knowledge Discovery Handbook, vol. 6, pp. 165– 192, 01 2005.

F. Colas and P. Brazdil, “Comparison of svm and some older classification algorithms in text classification tasks,” in Artificial Intelligence in Theory and Practice, (Boston, MA), pp. 169–178, Springer US, 2006.

Shaina, K. Kaan, N. Noman, and M. Olufemi, “A comparison among machine learning and deep learning approaches for text classification,” PsyArXiv, 2023.

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” arXiv preprint

arXiv:1912.05911, 2019.

F. Wei, R. Keeling, N. Huber-Fliflet, J. Zhang, A. Dabrowski, J. Yang, Q. Mao, and H. Qin, “Empirical

study of llm fine-tuning for text classification in legal document review,” in 2023 IEEE International Conference

on Big Data (BigData), pp. 2786–2792, 2023.

Z. Wang, T. Wang, D. Mekala, and J. Shang, “A benchmark on extremely weakly supervised text classification:

Reconcile seed matching and prompting approaches,” arXiv preprint arXiv:2305.12749, 2023.

M. Reusens, A. Stevens, J. Tonglet, J. D. Smedt, W. Verbeke, S. vanden Broucke, and B. Baesens, “Evaluating text classification: A benchmark study,” Expert Systems with Applications, vol. 254, p. 124302, 2024.

K. Soky, S. Li, C. Chu, and T. Kawahara, “Finetuning pretrained model with embedding of domain and language information for asr of very low-resource settings,” International Journal of Asian Language Processing, vol. 33, no. 04, p. 2350024, 2023.

V. Chea, Y. K. Thu, C. Ding, M. Utiyama, A. M. Finch, and E. Sumita, “Khmer word segmentation using conditional random fields,” in Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC 29), 2015.

R. Phann, C. Soomlek, and P. Seresangtakul, “Multi-class text classification on khmer news articles using deep learning models with optimal hyperparameters,” ICIC International, vol. 18, no. 6, pp. 241–550, 2024.

R. Phann, C. Soomlek, and P. Seresangtakul, “Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms,” Acta Informatica Pragensia, vol. 2023, no. 2, pp. 243–259, 2023.

J. Martineau and T. Finin, “Delta tfidf: An improved feature space for sentiment analysis,” in Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM 2009), pp. 258–261, AAAI Press, May 2009.

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer,

and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2020.

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic bert sentence embedding,”

arXiv preprint arXiv:2007.01852, 2022.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A

robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.

G. Lample and A. Conneau, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019.

M. Guo, Q. Shen, Y. Yang, H. Ge, D. Cer, G. Hernandez Abrego, K. Stevens, N. Constant, Y.-H. Sung, B. Strope,

and R. Kurzweil, “Effective parallel corpus mining using bilingual sentence embeddings,” in Proceedings of the

Third Conference on Machine Translation: Research Papers, (Brussels, Belgium), pp. 165–176, Association for Computational Linguistics, October 2018.

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2019.

Published
2025-05-26