Contextual Semantic Embeddings Based on Transformer Models for Arabic Biomedical Questions Classification

Ismail Ait Talghalit, Hamza Alami, Said Ouatik El Alaoui

Abstract


Arabic biomedical question classification (ABQC) is a challenging task due to various reasons including, the specialized jargon expressed in Arabic language, complex semantics of Arabic vocabulary and the lack of specific datasets and corpora. When representing questions, only a few studies deal with ABQC by taking into account the word context. In this work, we propose a classification model designed for Arabic biomedical questions. We build vector representations capturing the contextual and semantic information of Arabic biomedical text, which presents numerous challenges, such as the derivational morphology of Arabic language, the specialized terminology of biomedical terms and the lack of capitalization in text. Our representation adapts the extensive knowledge encoded in BERT (Bidirectional Encoder Representations from Transformers) and other transformer models, to address the aforementioned challenges. Several experiments have been conducted on a dedicated Arabic biomedical dataset namely: MAQA, with well-known transformer models including BERT, AraBERT, BioBERT, RoBERTa, and DistilBERT fine-tuned for the classification task. Obtained results show that our method achieves remarkable performance with an accuracy of 93.31% and an F1-score of 93.35%.

 

Doi: 10.28991/HIJ-2024-05-04-011

Full Text: PDF


Keywords


Arabic Question Classification; Biomedical Domain; Natural Language Processing; Transformers; BERT; Fine-Tuning; Question Answering Systems; Sentence Embedding.

References


Sarrouti, M., & El Alaoui, S. O. (2017). A machine learning-based method for question type classification in biomedical question answering. Methods of Information in Medicine, 56(3), 209–216. doi:10.3414/ME16-01-0116.

Xu, S., Cheng, G., & Kong, F. (2016). Research on question classification for automatic question answering. 2016 international conference on Asian language processing (IALP), 218–221. doi:10.1109/IALP.2016.7875972.

Babu, A., & Boddu, S. B. (2024). BERT-Based Medical Chatbot: Enhancing Healthcare Communication through Natural Language Understanding. Exploratory Research in Clinical and Social Pharmacy, 13, 100419. doi:10.1016/j.rcsop.2024.100419.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1, 4171–4186.

Tama, B. A., & Lim, S. (2020). A comparative performance evaluation of classification algorithms for clinical decision support systems. Mathematics, 8(10), 1–24. doi:10.3390/math8101814.

Hassan, E., Abd El-Hafeez, T., & Shams, M. Y. (2024). Optimizing classification of diseases through language model analysis of symptoms. Scientific Reports, 14(1), 01 2024. doi:10.1038/s41598-024-51615-5.

Momtazi, S. (2018). Unsupervised Latent Dirichlet Allocation for supervised question classification. Information Processing and Management, 54(3), 380–393. doi:10.1016/j.ipm.2018.01.001.

Hamza, A., En-Nahnahi, N., Zidani, K. A., & El Alaoui Ouatik, S. (2021). An Arabic question classification method based on new taxonomy and continuous distributed representation of words. Journal of King Saud University - Computer and Information Sciences, 33(2), 218–224. doi:10.1016/j.jksuci.2019.01.001.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, 1–12.

Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for sentence summarization. Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 379–389.

Aggarwal, C. C., & Zhai, C. X. (2012). A survey of text clustering algorithms. Mining Text Data, 9781461432234, 77–128. doi:10.1007/978-1-4614-3223-4_4.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. ArXiv Preprint, ArXiv:1310.4546. doi:10.48550/arXiv.1310.4546.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1532–1543. doi:10.3115/v1/d14-1162.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. doi:10.1162/tacl_a_00051.

Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6(1), 52. doi:10.1038/s41597-019-0055-0.

Lahbari, I., & El Alaoui, S. O. (2024). Exploring Sentence Embedding Representation for Arabic Question Answering. International Journal of Computing and Digital Systems, 15(1), 1229–1241. doi:10.12785/ijcds/150187.

Antoun, W., Baly, F., & Hajj, H. (2020). Arabert: Transformer-based model for Arabic language understanding. arXiv preprint, arXiv:2003.00104. doi:10.48550/arXiv.2003.00104.

Abdelhay, M., & Mohammed, A. (2022). MAQA: Medical Arabic Q & A dataset. Harvard Dataverse, Cambridge, United States.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (1907). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. doi:10.48550/arXiv.1907.11692.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. doi:10.1093/bioinformatics/btz682.

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint, arXiv:1910.01108. doi:10.48550/arXiv.1910.01108.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017-December, 5999–6009.

Mutabazi, E., Ni, J., Tang, G., & Cao, W. (2023). An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM. Applied Sciences (Switzerland), 13(15), 8623. doi:10.3390/app13158623.

Vihikan, W. O., & Trisna, I. N. P. Indonesian health question multi-class classification based on deep learning. Journal of Information Systems and Informatics, 6(3), 1931–1944.

Mansour, M., Tohamy, M., Ezzat, Z., & Torki, M. (2020). {A}rabic Dialect Identification Using {BERT} Fine-Tuning. Proceedings of the Fifth Arabic Natural Language Processing Workshop, 308–312.

Boudjellal, N., Zhang, H., Khan, A., Ahmad, A., Naseem, R., Shang, J., & Dai, L. (2021). ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition. Complexity, 2021, 1–6. doi:10.1155/2021/6633213.

Zafar, A., Sahoo, S. K., Varshney, D., Das, A., & Ekbal, A. (2024). KIMedQA: towards building knowledge-enhanced medical QA models. Journal of Intelligent Information Systems, 62(3), 833–858. doi:10.1007/s10844-024-00844-1.

Hammoud, J., Vatian, A., Dobrenko, N., Vedernikov, N., Shalyto, A., & Gusarova, N. (2021). New Arabic Medical Dataset for Diseases Classification. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13113 LNCS, 196–203. doi:10.1007/978-3-030-91608-4_20.

Al-Smadi, B. S. (2024). DeBERTa-BiLSTM: A multi-label classification model of Arabic medical questions using pre-trained models and deep learning. Computers in Biology and Medicine, 170, 107921. doi:10.1016/j.compbiomed.2024.107921.

Yu, H., Liu, C., Zhang, L., Wu, C., Liang, G., Escorcia-Gutierrez, J., & Ghoneim, O. A. (2023). An intent classification method for questions in “Treatise on Febrile diseases” based on TinyBERT-CNN fusion model. Computers in Biology and Medicine, 162, 107075. doi:10.1016/j.compbiomed.2023.107075.

Kofi Akpatsa, S., Lei, H., Li, X., Kofi Setornyo Obeng, V.-H., Mensah Martey, E., Clement Addo, P., & Dodzi Fiawoo, D. (2022). Online News Sentiment Classification Using DistilBERT. Journal of Quantum Computing, 4(1), 1–11. doi:10.32604/jqc.2022.026658.

Aftan, S., & Shah, H. (2023). Using the AraBERT Model for Customer Satisfaction Classification of Telecom Sectors in Saudi Arabia. Brain Sciences, 13(1), 147. doi:10.3390/brainsci13010147.

El-Alami, F. zahra, Ouatik El Alaoui, S., & En Nahnahi, N. (2022). Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization. Journal of King Saud University - Computer and Information Sciences, 34(10), 8422–8428. doi:10.1016/j.jksuci.2021.02.005.

Houssein, E. H., Mohamed, R. E., Hu, G., & Ali, A. A. (2024). Adapting transformer-based language models for heart disease detection and risk factors extraction. Journal of Big Data, 11(1). doi:10.1186/s40537-024-00903-y.

Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session, 11–16. doi:10.18653/v1/n16-3003.

Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to Fine-Tune BERT for Text Classification? Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11856 LNAI, 194–206. doi:10.1007/978-3-030-32381-3_16.

Abdelhay, M., Mohammed, A., & Hefny, H. A. (2023). Deep learning for Arabic healthcare: MedicalBot. Social Network Analysis and Mining, 13(1), 71. doi:10.1007/s13278-023-01077-w.


Full Text: PDF

DOI: 10.28991/HIJ-2024-05-04-011

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Ismail AIT TALGHALIT