Comparative Analysis of Deep Learning Models for Part of Speech Tagging in the Malay Language

Bakare Mustaphaa Adebayo, Kalaiarasi Sonai Muthu Anbananthen, Saravanan Muthaiyah, Saravanan Nathan Lurudusamy

Abstract


Despite the widespread use of Malay, under-resourced languages like Malay face challenges in Natural Language Processing (NLP), particularly in Part-of-Speech (POS) tagging. The scarcity of annotated corpora poses a primary obstacle to POS tagging in Malay. This study aims to enhance the effectiveness and reliability of POS tagging models explicitly tailored for under-resourced languages within the field of NLP, focusing on Malay. Existing models, which rely on Conditional Random Fields and Hidden Markov Models, exhibit limitations, underscoring the need for more robust approaches. The research conducts a comparative analysis of various deep-learning models with different encoders for POS tagging in Malay sentences. The experimental analysis demonstrates that the Bidirectional Long Short-Term Memory (Bi-LSTM) model, leveraging a pre-trained Bidirectional Encoder Representations from Transformers (BERT) embedding model, achieves exceptional accuracy, precision, recall, and F1 scores in predicting tags. Notably, the BERT + Bi-LSTM model, boasting an accuracy of 98.82%, outperforms other models, showcasing superior performance across all evaluated metrics. Additionally, this combined model effectively handles known and unknown words, yielding highly accurate POS tagging results for Malay sentences.

 

Doi: 10.28991/HIJ-2024-05-02-04

Full Text: PDF


Keywords


Part of Speech Tagging; Deep Learning; Malay Text; Malay POS Tagger.

References


Chiche, A., & Yitagesu, B. (2022). Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of Big Data, 9(1). doi:10.1186/s40537-022-00561-y.

Anbananthen, K. S. M., Krishnan, J. K., Sayeed, M. S., & Muniapan, P. (2017). Comparison of Stochastic and Rule-Based POS Tagging on Malay Online Text. American Journal of Applied Sciences, 14(9), 843–851. doi:10.3844/ajassp.2017.843.851.

Anbananthen, S. K., Sainarayanan, G., Chekima, A., & Teo, J. (2006). Data mining using pruned artificial neural network tree (ANNT). 2nd International Conference on Information & Communication Technologies, Damascus, Syria. doi:10.1109/ICTTA.2006.1684577.

Ali, N. M., Ngo, G. H., & Lan, A. L. H. (2023). Construction of Part of Speech Tagger for Malay Language: A Review. Proceedings - 2023 5th International Conference on Natural Language Processing, (ICNLP 2023), 253–257. doi:10.1109/ICNLP58431.2023.00053.

Tiun, S., Ariffin, S. N. A. N., & Chew, Y. D. (2022). POS Tagging Model for Malay Tweets Using New POS Tagset and BiLTSM-CRF Approach. CEUR Workshop Proceedings, 3315, 160–165.

Mohamed, H., Omar, N., & Ab Aziz, M. J. (2011). Statistical Malay part-of-speech (POS) tagger using Hidden Markov approach. 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011, June, 231–236. doi:10.1109/STAIR.2011.5995794.

Ariffin, S. N. A. N., & Tiun, S. (2018). Part-of-speech tagger for Malay social media texts. GEMA Online Journal of Language Studies, 18(4), 124–142. doi:10.17576/gema-2018-1804-09.

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, 8(1), 53. doi:10.1186/s40537-021-00444-8.

Sonai, K., Anbananthen, M., Mohamed, A., & Elyasir, H. (2013). Evolution of Opinion Mining. Australian Journal of Basic and Applied Sciences, 7(6), 359–370.

Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging. Computational Linguistics, 21(4), 543–565.

Brill, E. (1992). A simple rule-based part of speech tagger. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 152-159. doi:10.3115/974499.974526.

Garg, N., Goyal, V., & Preet, S. (2012). Rule-Based Hindi Part of Speech Tagger. International Conference on Computational Linguistics, 2(December), 163–174.

Lee, G. G., Cha, J., & Lee, J. H. (2002). Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean. Computational Linguistics, 28(1), 53–70. doi:10.1162/089120102317341774.

Tyagi, S., & Shankar Mishra, G. (2016). Statistical Analysis of Part of Speech (Pos) Tagging Algorithms for English Corpus. International Journal of Advance Research, Ideas and Innovations in Technology, 2(3), 1-9.

Albared, M., Omar, N., Aziz, M. J. A., & Ahmad Nazri, M. Z. (2010). Automatic part of speech tagging for Arabic: An experiment using bigram hidden markov model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 6401 LNAI, 361–370. doi:10.1007/978-3-642-16248-0_52.

Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018). Part-of-Speech Tagging for Azerbaijani Language. IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 – Proceedings, Almaty, Kazakhstan. doi:10.1109/ICAICT.2018.8747154.

Cahyani, D. E., & Vindiyanto, M. J. (2019). Indonesian part of speech tagging using hidden markov model - Ngram viterbi. 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2019, 353–358. doi:10.1109/ICITISEE48480.2019.9003989.

Paul, A., Purkayastha, B. S., & Sarkar, S. (2016). Hidden Markov Model based Part of Speech Tagging for Nepali language. 2015 International Symposium on Advanced Computing and Communication, ISACC 2015, 149–156. doi:10.1109/ISACC.2015.7377332.

Ayogu, I. I., Adetunmbi, A. O., Ojokoh, B. A., & Oluwadare, S. A. (2017). A comparative study of hidden Markov model and conditional random fields on a Yorùba part-of-speech tagging task. Proceedings of the IEEE International Conference on Computing, Networking and Informatics, ICCNI 2017, 2017-January, 1–6. doi:10.1109/ICCNI.2017.8123784.

Krishnapriya, V., Sreesha, P., Harithalakshmi, T. R., Archana, T. C., & Vettath, J. N. (2003). Design of a POS tagger using conditional random fields for Malayalam. 2014 1st International Conference on Computational Systems and Communications, ICCSC 2014, 370–373. doi:10.1109/COMPSC.2014.7032680.

Nasim, Z., Abidi, S., & Haider, S. (2020). Modeling POS Tagging for the Urdu Language. 2020 International Conference on Emerging Trends in Smart Technologies, ICETST 2020, Karachi, Pakistan. doi:10.1109/ICETST49965.2020.9080721.

Deka, R. R., Kalita, S., Kashyap, K., Bhuyan, M. P., & Sarma, S. K. (2020). A Study of T’nT and CRF Based Approach for POS tagging in assamese language. In Proceedings of the 3rd International Conference on Intelligent Sustainable Systems, ICISS 2020, 600–604. doi:10.1109/ICISS49785.2020.9315939.

Tran, O. T., Le, C. A., Ha, T. Q., & Le, Q. H. (2009). An experimental study on Vietnamese POS tagging. 2009 International Conference on Asian Language Processing: Recent Advances in Asian Language Processing, IALP 2009, 23–27. doi:10.1109/IALP.2009.14.

Fanoon, A. R. F. S., & Uwanthika, G. A. I. (2019). Part of speech tagging for Twitter conversations using Conditional Random Fields model. Proceedings - IEEE International Research Conference on Smart Computing and Systems Engineering, SCSE 2019, 108–112. doi:10.23919/SCSE.2019.8842669.

Zhang, X., Huang, H., & Zhang, L. (2009). The application of CRFs in part-of-speech tagging. International Conference on Intelligent Human-Machine Systems and Cybernetics, IHMSC 2009, Vol. 2, 347–350. doi:10.1109/IHMSC.2009.210.

Marquez, Ll. (1999). Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees. Universitat Politècnica de Catalunya, Barcelona, Spain. doi:10.5821/dissertation-2117-93974.

Ratnaparkhi, A. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 1996, 133–142.

Taulé, M., M.A. Martí, M. R. (2008). Ancora: Multilingual and multilevel annotated corpora. Proceedings of 6th International Conference on Language Resources and Evaluation, 96–101.

Busst, M. M. A., Anbananthen, K. S. M., Kannan, S., Krishnan, J., & Subbiah, S. (2024). Ensemble BiLSTM: A Novel Approach for Aspect Extraction From Online Text. IEEE Access, 12(January), 3528–3539. doi:10.1109/ACCESS.2023.3349203.

Bakare, A. M., Anbananthen, K. S. M., Muthaiyah, S., Krishnan, J., & Kannan, S. (2023). Punctuation Restoration with Transformer Model on Social Media Data. Applied Sciences (Switzerland), 13(3), 1685. doi:10.3390/app13031685.

Sayeed, M. S., Mohan, V., & Muthu, K. S. (2023). BERT: A Review of Applications in Sentiment Analysis. HighTech and Innovation Journal, 4(2), 453–462. doi:10.28991/HIJ-2023-04-02-015.

Gopalakrishnan, A., Soman, K. P., & Premjith, B. (2019). Part-of-Speech Tagger for Biomedical Domain Using Deep Neural Network Architecture. 2019 10th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2019, 6–10. doi:10.1109/ICCCNT45670.2019.8944559.

Kumar, S., Kumar, M. A., & Soman, K. P. (2019). Deep learning-based part-of-speech tagging for Malayalam twitter data (Special Issue: Deep learning techniques for natural language processing). Journal of Intelligent Systems, 28(3), 423–435. doi:10.1515/jisys-2017-0520.

Sayami, S., & Shakya, S. (2020). Nepali POS Tagging Using Deep Learning Approaches. International Journal of Science, 17(2), 69–84.

Hoojon, R., & Nath, D. A. (2023). BiLSTM with CRF Part-of-Speech Tagging for Khasi language. 2023 4th International Conference on Computing and Communication Systems, I3CS 2023, I3CS 2023, 1–7. doi:10.1109/I3CS58314.2023.10127278.

Kabir, M. F., Abdullah-Al-Mamun, K., & Huda, M. N. (2016). Deep learning-based parts of speech tagger for Bengali. 2016 5th International Conference on Informatics, Electronics and Vision, ICIEV 2016, 26–29. doi:10.1109/ICIEV.2016.7760098.

Song, H. J., & Park, S. B. (2020). Korean part-of-speech tagging based on morpheme generation. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(3), 1–10. doi:10.1145/3373608.

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2, 42–47.

Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged corpus and a tagger for Urdu. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, 2938–2943.

Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021). A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 2545–2568. doi:10.18653/v1/2021.naacl-main.201.

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 1-9. doi:10.48550/arXiv.1412.3555.

Conneau, A., & Lample, G. (2019). Cross-lingual language model pretraining. Advances in Neural Information Processing Systems, 32.


Full Text: PDF

DOI: 10.28991/HIJ-2024-05-02-04

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Bakare Mustaphaa Adebayo, Kalaiarasi Sonai Muthu Anbananthen, Saravanan Muthaiyah, Saravanan Nathan Lurudusamy