Performance of Methods in Identifying Similar Languages Based on String to Word Vector

Herry Sujaini



Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This study made an effort to identification Indonesian local languages, which used String to Word Vector approach. A string vector refers to a collection of ordered words. In a string vector, a word is represented as an element or value, while the word becomes an attribute or feature in each numeric vector. Among Naïve Bayes, SMO, J48, and ZeroR classifiers, SMO is found to be the most accurate classifier with a level of accuracy at 95.7% for 10-fold cross-validation and 94.4% for 60%: 40%. The best tokenizer in this classification is Character N-Gram. All classifiers, except ZeroR shows increased accuracy when using Character N-Gram Tokenizer compared to Word Tokenizer. The best features of this system are the TriGram and FourGram Character. The TriGram is preferred because it requires smaller training data. The highest accuracy value in the combination experiment is 0.965 obtained at a combination of IDF = FALSE and WC = TRUE, regardless the conditions of the TF.


identification of languages; local languages; string to word vector

Full Text:



S. Sudaryanto, "Tiga Fase Perkembangan Bahasa Indonesia (1928—2009): Kajian Linguistik Historis", Aksis Jurnal Pendidikan Bahasa dan Sastra Indonesia , Vol. 2. No 1, 2018.

E. Novianti, "Menilik Nasib Bahasa Melayu Pontianak". International Seminar Language Maintenance and Shiff. Pp. 70- 74. 2011.

M.Z. Wiguna, "Tindak Tutur Bahasa Melayu Dialek Sambas di Kabupaten Sambas", Jurnal Pendidikan Bahasa, Vol. 5, No. 2, Desember 2016

M. Zampieri, B. Gebrekidan Gebre, H. Costa, and J. van Genabith. "Comparing approaches to the identification of similar languages". In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pages 66–72. Association for Computational Linguistics. 2015.

F. Omar, Zaidan and C.C. Burch. "Arabic dialect identification". Computational Linguistics, 40(1):171–202. 2014.

M. Lu and M. Mohamed. "Lahga: Arabic dialect classifier". Report, December 13, 2011.

H. Elfardy and M. Diab. "Sentence level dialect identification in arabic". In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, page 456-461. 2013.

N.E. Safitri, A. Zahra, and M. Adriani, “Spoken Language Identification with Phonotactics Methods on Minangkabau, Sundanese, and Javanese Languages,” Procedia Computer Science, vol. 81, pp. 182–187, 2016.

J. Zhao, S. Mudgal, and Y. Liang, “Generalizing Word Embeddings using Bag of Subwords,” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.

G. H. Rachman, M. L. Khodra, and D. H. Widyantoro, “Word Embedding for Rhetorical Sentence Categorization on Scientific Articles,” Journal of ICT Research and Applications, vol. 12, no. 2, p. 168, 2018.

T.O. Ayodele. "Types of machine learning algorithms". 2010.

M. Hall, E.Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. "The weka data mining software: An update". SIGKDD Explorations, 11(1):10–18. 2009.

I. H. Witten, and E. Frank. "Data mining: Practical machine learning tools and techniques with Java implementations". San Francisco, CA: Morgan Kaufmann. 2016.

T. Jo, "Representation of Texts into String Vectors for Text Categorization". Journal of Computing Science and Engineering, 4(2), 110-127. 2010

T. Joachims. "Text categorization with support vector machines: Learning with many relevant features". 1998.

F. Handayani and F. S. Pribadi, “Implementasi Algoritma Naïve Bayes Classifier dalam Pengklasifikasian Teks Otomatis Pengaduan dan Pelaporan Masyarakat melalui Layanan Call Center 110,” J. Tek. Elektro, 2015.

S. Diwandari and N. A. Setiawan, “Perbandingan Algoritme J48 dan Nbtree untuk Klasifikasi Diagnosa Penyakit Pada Soybean,” Semin. Nas. Teknol. Inf. dan Komun., 2015.

C. Nasa and S. Suman, “Evaluation of Different Classification Techniques for WEB Data,” Int. J. Comput. Appl., 2012.

B.G. Gebre, M.Zampieri, P. Wittenburg, and T. Heskes. "Improving native language identification with tf-idf weighting". In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 216–223. Association for Computational Linguistics. 2013.

Article Metrics

Abstract view(s): 157 time(s)
PDF: 96 time(s)


  • There are currently no refbacks.