Evaluating the Performance of Topic Modeling Techniques for Bibliometric Analysis Research: An LDA-based Approach

Lan Thi Nguyen, Wirapong Chansanam, Nalatpa Hunsapun, Vispat Chaichuay, Suparp Kanyacome, Akkharawoot Takhom, Yuttana Jaroenruen, Chunqiu Li


Digital technologies have been used for a vast amount of bibliometric analysis research. Although these technologies have made scientific investigation more accessible and efficient, scholars now face the daunting task of sifting through an overwhelming number of documents. This study aims to identify bibliometric research analysis's primary topics, categories, and latent topics from a global perspective. This study utilized topic modeling techniques to analyze the abstracts of 16,039 eligible papers published between 1977 and 2023 in the Scopus database. Through the use of Latent Dirichlet Allocation (LDA) topic modeling, the study was able to identify four distinct research topics and observe how they have evolved over time. The research topic has shifted its focus from individual concepts and words to relationships between nodes and conceptual, intellectual, and social structures. The study’s findings have significant implications for bibliometric analysis-related research, providing valuable insights into trends and patterns in bibliometric analysis content within large digital article archives. The LDA has proven to be an efficient tool for analyzing these trends and patterns quickly. This study's novel approach considers factors for word embedding usage and optimal topic numbers. It focuses on a full understanding of the LDA results and combines statistical analysis, domain knowledge, and temporal exploration to better understand how data structures work.


Doi: 10.28991/HIJ-2024-05-02-07

Full Text: PDF


Bibliometric; LDA; Topic Modeling; Topic Trends; Performance Evaluation.


Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., & Lim, W. M. (2021). How to conduct a bibliometric analysis: An overview and guidelines. Journal of Business Research, 133, 285-296. doi:10.1016/j.jbusres.2021.04.070.

Mejia, C., Wu, M., Zhang, Y., & Kajikawa, Y. (2021). Exploring topics in bibliometric research through citation networks and semantic analysis. Frontiers in Research Metrics and Analytics, 6, 742311. doi:10.3389/frma.2021.742311.

Ninkov, A., Frank, J. R., & Maggio, L. A. (2022). Bibliometrics: Methods for studying academic publishing. Perspectives on medical education, 11(3), 173-176. doi:10.1007/s40037-021-00695-4.

Li, X., & Lei, L. (2021). A bibliometric analysis of topic modelling studies (2000–2017). Journal of Information Science, 47(2), 161-175. doi:10.1177/0165551519877049.

Kuhn, K. D. (2018). Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transportation Research Part C: Emerging Technologies, 87, 105-122. doi:10.1016/j.trc.2017.12.018.

Nielsen, M. W., & Börjeson, L. (2019). Gender diversity in the management field: Does it matter for research outcomes?. Research Policy, 48(7), 1617-1632. doi:10.1016/j.respol.2019.03.006.

Gohari, P., Wu, B., Hawkins, C., Hale, M., & Topcu, U. (2021). Differential privacy on the unit simplex via the dirichlet mechanism. IEEE Transactions on Information Forensics and Security, 16, 2326-2340. doi:10.1109/TIFS.2021.3052356.

Jiang, H., Qiang, M., & Lin, P. (2016). Finding academic concerns of the Three Gorges Project based on a topic modeling approach. Ecological indicators, 60, 693-701. doi:10.1016/j.ecolind.2015.08.007.

Li, Y., Jiang, D., Lian, R., Wu, X., Tan, C., Xu, Y., & Su, Z. (2021). Heterogeneous latent topic discovery for semantic text mining. IEEE Transactions on Knowledge and Data Engineering, 35(1), 533-544. doi:10.1109/TKDE.2021.3077025.

Zhou, X., Liang, W., Luo, Z., & Pan, Y. (2021). Periodic-aware intelligent prediction model for information diffusion in social networks. IEEE Transactions on Network Science and Engineering, 8(2), 894-904. doi:10.1109/TNSE.2021.3064952.

Isoaho, K., Gritsenko, D., & Mäkelä, E. (2021). Topic modeling and text analysis for qualitative policy research. Policy Studies Journal, 49(1), 300-324. doi:10.1111/psj.12343.

Kwok, S. W. H., Vadde, S. K., & Wang, G. (2021). Tweet topics and sentiments relating to COVID-19 vaccination among Australian Twitter users: machine learning analysis. Journal of medical Internet research, 23(5), e26953. doi:10.2196/26953.

Wu, Q., Hare, A., Wang, S., Tu, Y., Liu, Z., Brinton, C. G., & Li, Y. (2021). Bats: A spectral biclustering approach to single document topic modeling and segmentation. ACM Transactions on Intelligent Systems and Technology (TIST), 12(5), 1-29. doi:10.1145/3468268.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. doi:10.1145/2133806.2133826.

Yin, B., & Yuan, C. H. (2022). Detecting latent topics and trends in blended learning using LDA topic modeling. Education and Information Technologies, 27, 12689–12712. doi:10.1007/s10639-022-11118-0.

Hwang, S., & Cho, E. (2021). Exploring Latent Topics and Research Trends in Mathematics Teachers’ Knowledge Using Topic Modeling: A Systematic Review. Mathematics, 9(22), 2956. doi:10.3390/math9222956.

Schoepflin, U., & Glänzel, W. (2001). Two decades of" Scientometrics". An interdisciplinary field represented by its leading journal. Scientometrics, 50(2), 301-312. doi:10.1023/a:1010577824449.

Jonkers, K., & Derrick, G. E. (2012). The bibliometric bandwagon: Characteristics of bibliometric articles outside the field literature. Journal of the American Society for Information Science and Technology, 63(4), 829-836. doi:10.1002/asi.22620.

Milojević, S., & Leydesdorff, L. (2013). Information metrics (iMetrics): A research specialty with a socio-cognitive identity?. Scientometrics, 95, 141-157. doi:10.1007/s11192-012-0861-z.

Ayaz, A., Ozyurt, O., Al-Rahmi, W. M., Salloum, S., Shutaleva, A., Alblehai, F., & Habes, M. (2023). Exploring Gamification Research Trends Using Topic Modeling. IEEE Access, 11, 119676-119692. doi:10.1109/ACCESS.2023.3326444.

Robledo, S., & Zuluaga, M. (2022). Topic modeling: Perspectives from a literature review. IEEE Access, 11, 4066-4078. doi:10.1109/ACCESS.2022.3232939.

Mifrah, S., & Benlahmar, E. H. (2020). Topic modeling coherence: A comparative study between LDA and NMF models using COVID’19 corpus. International Journal of Advanced Trends in Computer Science and Engineering, 5756-5761. doi:10.30534/ijatcse/2020/231942020.

Cui, W., Jinling, L., Zhang, T., & Zhang, S. (2023). A Recognition Method of Measuring Literature Topic Evolution Paths Based on K-means-NMF. Knowledge Organization, 50(4), 257-271. doi:10.5771/0943-7444-2023-4-257.

Motamedi, N., Ghazimirsaeid, J., Sheikhshoaei, F., Mansourzadeh, M. J., & Dehdarirad, H. (2023). Bibliometric Analysis and Topic Modeling of Information Systems in Maternal Health Publications. International Journal of Information Science and Management, 21(2), 85-101. doi:10.22034/ijism.2023.1977814.0.

Almenara, C. A. (2022). 40 years of research on eating disorders in domain-specific journals: Bibliometrics, network analysis, and topic modeling. PloS one, 17(12), e0278981. doi:10.1371/journal.pone.0278981.

Sharma, C., Batra, I., Sharma, S., Malik, A., Hosen, A. S., & Ra, I. H. (2022). Predicting trends and research patterns of smart cities: A semi-automatic review using latent dirichlet allocation (LDA). IEEE Access, 10, 121080-121095. doi:10.1109/ACCESS.2022.3214310.

Gurcan, F., & Cagiltay, N. E. (2022). Exploratory analysis of topic interests and their evolution in bioinformatics research using semantic text mining and probabilistic topic modeling. IEEE Access, 10, 31480-31493. doi:10.1109/ACCESS.2022.3160795.

Cobelli, N., & Blasi, S. (2024). Combining topic modeling and bibliometric analysis to understand the evolution of technological innovation adoption in the healthcare industry. European Journal of Innovation Management, 27(9), 127-149. doi:10.1108/EJIM-06-2023-0497.

Chen, X., & Xie, H. (2020). A structural topic modeling-based bibliometric study of sentiment analysis literature. Cognitive Computation, 12, 1097-1129. doi:10.1007/s12559-020-09745-1.

Chen, X., Xie, H., Cheng, G., & Li, Z. (2022a). A decade of sentic computing: topic modeling and bibliometric analysis. Cognitive computation, 14(1), 24-47. doi:10.1007/s12559-021-09861-6.

Jiang, H., Qiang, M., & Lin, P. (2016). A topic modeling based bibliometric exploration of hydropower research. Renewable and Sustainable Energy Reviews, 57, 226-237. doi:10.1016/j.rser.2015.12.194.

Linnenluecke, M. K., Marrone, M., & Singh, A. K. (2020). Conducting systematic literature reviews and bibliometric analyses. Australian Journal of Management, 45(2), 175-194. doi:10.1177/0312896219877678.

Chen, X., Zou, D., & Xie, H. (2022). A decade of learning analytics: Structural topic modeling based bibliometric analysis. Education and Information Technologies, 27(8), 10517-10561. doi:10.1007/s10639-022-11046-z.

Amaro, A., & Bacao, F. (2024). Topic Modeling: A Consistent Framework for Comparative Studies. Emerging Science Journal, 8(1), 125-139. doi:10.28991/ESJ-2024-08-01-09.

Cho, S. B., Shin, S., & Kang, D. S. (2018). A study on the research trends on open innovation using topic modeling. Informatization policy, 25(3), 52-74.

Ali, M. (2020). PyCaret: An open source, low-code machine learning library in Python. PyCaret Version, 2.

Bettina, G., & Kurt, H. (2011). Topic models: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1-30. doi:10.18637/jss.v040.i13.

Chowdhury, C. R., & Bhuyan, P. (2010). Information retrieval using fuzzy c-means clustering and modified vector space model. 3rd International Conference on Computer Science and Information Technology, 1, 696-700. doi:10.1109/ICCSIT.2010.5564542.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78, 15169-15211. doi:10.1007/s11042-018-6894-4.

Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 952-961.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 22.

Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. Proceedings of the eighth ACM international conference on Web search and data mining, 399-408. doi:10.1145/2684822.2685324.

Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta, Malta: University of Malta, 2010, 46-50.

Chen, X., Zou, D., & Xie, H. (2020). Fifty years of British Journal of Educational Technology: A topic modeling based bibliometric perspective. British Journal of Educational Technology, 51(3), 692-708. doi:10.1111/bjet.12907.

Ozansoy Çadırcı, T., & Sağkaya Güngör, A. (2021). 26 years left behind: a historical and predictive analysis of electronic business research. Electronic Commerce Research, 21, 223-243. doi.org:10.1007/s10660-021-09459-y.

Zhu, B., Zheng, X., Liu, H., Li, J., & Wang, P. (2020). Analysis of spatiotemporal characteristics of big data on social media sentiment with COVID-19 epidemic topics. Chaos, Solitons & Fractals, 140, 110123. doi:10.1016/j.chaos.2020.110123.

Bovens, L., & Hartmann, S. (2003). Solving the riddle of coherence. Mind, 112(448), 601-633. doi:10.1093/mind/112.448.601

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 100-108.

Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530-539.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Proceedings of the 2011 conference on empirical methods in natural language processing, 262-272.

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781. doi:10.1016/j.neucom.2008.06.011.

Sievert, C., & Shirley, K. (2014, June). pyLDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces, 63-70.

Small, H. (1997). Update on science mapping: Creating large document spaces. Scientometrics, 38, 275-293. doi:10.1007/BF02457414.

Börner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual review of information science and technology, 37(1), 179-255. doi:10.1002/aris.1440370106.

Chuang, J., Manning, C. D., & Heer, J. (2012). Termite: Visualization techniques for assessing textual topic models. Proceedings of the international working conference on advanced visual interfaces, 74-77.

Zhao, W., Chen, J.J., Perkins, R. et al. A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16 (Suppl 13), S8 (2015). doi:10.1186/1471-2105-16-S13-S8.

Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... & Adam, S. (2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118. doi:10.1080/19312458.2018.1430754.

Aria, M., & Cuccurullo, C. (2017). Bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975. doi:10.1016/j.joi.2017.08.007.

Full Text: PDF

DOI: 10.28991/HIJ-2024-05-02-07


  • There are currently no refbacks.

Copyright (c) 2024 Lan Thi Nguyen, Wirapong Chansanam, Nalatpa Hunsapun, Vispat Chaichuay, Suparp Kanyacome, Akkharawoot Takhom, Yuttana Jaroenruen