Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning

Abdiansah Abdiansah(1*), Muhammad Qurhanul Rizqie(2),

(1) Universitas Sriwijaya
(2) Universitas Sriwijaya
(*) Corresponding Author
DOI: https://doi.org/10.23917/khif.v9i2.21669

Abstract

Language Identification (LID) aims to guess or identify which language the text or sound is coming from. Language identification tends to be easier in languages with different characteristics (e.g., Indonesian and English), but not for languages with similar characteristics (e.g., Indonesian and Malaysian). Similar languages can cause ambiguity that will be a bias for machine learning. Using Support Vector Machine (SVM) technique, this research tried to identify the Indonesian or Malaysian language. The training and testing data are taken from Leipzig Corpora Collection and Twitter dataset. The feature representation technique uses TF-IDF, and the baseline testing uses Naive Bayes Multinomial. We used two training techniques: split (20:80) and 10-cross validation. The experimental results show that the accuracy between the baseline and SVM is not too far. Both provide accuracy of around 90% and above. The results indicate that Indonesian and Malaysian language identification accuracy is relatively high even though using simple techniques.

Keywords

Language Identification; Indonesian; Malaysian; Support Vector Machine

Full Text:

PDF

References

T. Jauhiainen, K. Lindén, and H. Jauhiainen, “Evaluation of language identification methods using 285 languages,” in NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference, 2017, no. May, pp. 183–191.

S. Carter, W. Weerkamp, and M. Tsagkias, “Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text,” Language Resources and Evaluation, vol. 47, no. 1, pp. 195–215, 2013, doi: 10.1007/s10579-012-9195-y.

R. Ferdiana, F. Jatmiko, D. D. Purwanti, A. S. T. Ayu, and W. F. Dicka, “Dataset Indonesia untuk Analisis Sentimen,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI), vol. 8, no. 4, p. 334, 2019, doi: 10.22146/jnteti.v8i4.533.

B. Ranaivo-Malancon, “Automatic Identification of Close Languages - Case study: Malay and Indonesian,” ECTI-CIT, vol. 2, no. 2, pp. 126–134, 2006, doi: 10.37936/ecti-cit.200622.53288.

Z. Indra, N. Zamin, and J. Jaafar, “A Language Identifier for Indonesian and Malay Text Document,” p. 5, 2015.

H. Nomoto, A. Shiro, and S. Asako, “Reclassification of the Leipzig Corpora Collection for Malay and Indonesian.” 東京外国語大学アジア・アフリカ言語文化研究所, Sep. 30, 2018. doi: 10.15026/92899.

Yoav Goldberg, “A Primer on Neural Network Models for Natural Language Processing,” Journal of Artificial Intelligence Research, vol. 57, pp. 345–420, 2016.

A. Massaro, V. Maritati, and A. Galiano, “Automated self-learning Chatbot initially built as a FAQS database information retrieval system: Multi-level and Intelligent Universal Virtual Front-Office Implementing Neural Network,” Informatica (Slovenia), vol. 42, no. 4, pp. 515–525, 2018, doi: 10.31449/inf.v42i3.2173.

A. Massaro, D. Giannone, V. Birardi, and A. M. Galiano, “An innovative approach for the evaluation of the web page impact combining user experience and neural network score,” Future Internet, vol. 13, no. 6, p. 145, 2021, doi: 10.3390/fi13060145.

S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, “Deep Learning Based Text Classification: A Comprehensive Review,” ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1–40, 2021.

A. Massaro, V. Vitti, A. Galiano, and A. Morelli, “Business Intelligence Improved by Data Mining Algorithms and Big Data Systems: An Overview of Different Tools Applied in Industrial Research,” Computer Science and Information Technology, vol. 7, no. 1, pp. 1–21, 2019, doi: 10.13189/csit.2019.070101.

Y. Li and B. Liu, “A new vector representation of short texts for classification,” International Arab Journal of Information Technology, vol. 17, no. 2, pp. 241–249, 2020, doi: 10.34028/iajit/17/2/12.

E. Tromp and M. Pechenizkiy, “Graph-based N-gram language identification on short texts,” in “Proceedings of the 20th annual Belgian-Dutch Conference on Machine Learning,” 2011, pp. 27–34.

P. Gamallo, M. Garcia, S. Sotelo, and J. R. Pichel, “Comparing ranking-based and Naive Bayes approaches to language detection on tweets,” in CEUR Workshop Proceedings, 2014, vol. 1228, pp. 12–16.

A. Jaech, G. Mulcaire, S. Hathi, M. Ostendorf, and N. A. Smith, “Hierarchical Character-Word Models for Language Identification,” in EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the 4th International Workshop on Natural Language Processing for Social Media, SocialNLP 2016, 2016, pp. 84–93. doi: 10.18653/v1/w16-6212.

T. Kocmi and O. Bojar, “LanideNN: Multilingual language identification on character window,” in 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, 2017, vol. 2, pp. 927–936. doi: 10.18653/v1/e17-1087.

D. Jurgens, Y. Tsvetkov, and D. Jurafsky, “Incorporating dialectal variability for socially equitable language identification,” in ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 2017, vol. 2, pp. 51–57. doi: 10.18653/v1/P17-2009.

D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages,” in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, 2012, pp. 759–765.

L. Bottou and C.-J. Lin, “Support Vector Machine Solvers,” Large-Scale Kernel Machines, vol. 3, no. 1, pp. 301–320, 2007, doi: 10.7551/mitpress/7496.003.0003.

W. S. Noble, “What is a support vector machine?,” Nature Biotechnology, vol. 24, no. 12, pp. 1565–1567, 2006, doi: 10.1038/nbt1206-1565.

A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes, “Multinomial naive bayes for text categorization revisited,” in Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), 2004, vol. 3339, pp. 488–499. doi: 10.1007/978-3-540-30549-1_43.

Article Metrics

Abstract view(s): 493 time(s)
PDF: 190 time(s)

Refbacks

  • There are currently no refbacks.