Smart Data Placement Strategy in Heterogeneous Hadoop

Nour-Eddine Bakni, Ismail Assayad

Abstract


Big Data platforms are becoming increasingly essential these days, given the volume of data generated every moment by millions of people around the world. The Hadoop framework is a solution that allows storing and processing these large amounts of data in parallel on a cluster of machines. The default data placement strategy adopted by the Hadoop Distributed File System (HDFS), initially designed for a homogeneous cluster where all machines are considered identical, relies on distributing data to nodes based only on their disk space availability. Implementing this strategy in a heterogeneous environment, where nodes have varying computing or disk storage capacities, may result in performance degradation. In this paper, we propose a smart data placement strategy (SDPS) in heterogeneous Hadoop clusters that aims to place high-access data on high-performance nodes. It takes cluster heterogeneity into account when distributing data by first dividing nodes into groups based on their performance levels using a clustering algorithm and then allocating data blocks to appropriate nodes based on their hotness. SDPS also allows dynamically specifying the replication factor of data blocks to reduce storage space waste while maintaining data availability. Experimental results show that SDPS is more efficient in a heterogeneous environment compared with the default data placement policy of HDFS, and it improves MapReduce data processing, data locality, and storage efficiency.

 

Doi: 10.28991/HIJ-2025-06-01-03

Full Text: PDF


Keywords


Big Data; Data Placement; Hadoop; HDFS; Heterogeneous Cluster.

References


Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Proceedings of the 2013 International Conference on Collaboration Technologies and Systems, CTS 2013, 42–47. doi:10.1109/CTS.2013.6567202.

Gong, C., Liu, J., Zhang, Q., Chen, H., & Gong, Z. (2010). The characteristics of cloud computing. Proceedings of the International Conference on Parallel Processing Workshops, 275–279. doi:10.1109/ICPPW.2010.45.

White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, California, United States.

Khezr, S. N., & Navimipour, N. J. (2017). MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. Journal of Grid Computing, 15(3), 295–321. doi:10.1007/s10723-017-9408-0.

Dev, D., & Patgiri, R. (2015). Performance evaluation of HDFS in big data management. 2014 International Conference on High Performance Computing and Applications, ICHPCA 2014, 9, 1–7. doi:10.1109/ICHPCA.2014.7045330.

Shah, A., & Padole, M. (2018). Load Balancing through Block Rearrangement Policy for Hadoop Heterogeneous Cluster. 2018 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2018, 230–236. doi:10.1109/ICACCI.2018.8554404.

Lee, C. W., Hsieh, K. Y., Hsieh, S. Y., & Hsiao, H. C. (2014). A Dynamic Data Placement Strategy for Hadoop in Heterogeneous Environments. Big Data Research, 1, 14–22. doi:10.1016/j.bdr.2014.07.002.

Reddy, K. H. K., Pandey, V., & Roy, D. S. (2019). A novel entropy-based dynamic data placement strategy for data intensive applications in Hadoop clusters. International Journal of Big Data Intelligence, 6(1), 20. doi:10.1504/ijbdi.2019.097395.

Shithil, S. M., Saha, T. K., & Sharma, T. (2017). A dynamic data placement policy for heterogeneous Hadoop cluster. 4th International Conference on Advances in Electrical Engineering, ICAEE 2017, 302–307. doi:10.1109/ICAEE.2017.8255371.

Bae, M., Yeo, S., Park, G., & Oh, S. (2021). Novel data-placement scheme for improving the data locality of Hadoop in heterogeneous environments. Concurrency and Computation: Practice and Experience, 33(18), 5752. doi:10.1002/cpe.5752.

Xiong, R., Luo, J., & Dong, F. (2015). SLDP: A Novel Data Placement Strategy for Large-Scale Heterogeneous Hadoop Cluster. Proceedings - 2014 2nd International Conference on Advanced Cloud and Big Data, CBD 2014, 158, 9–17. doi:10.1109/CBD.2014.57.

Liu, Y., Wu, C. Q., Wang, M., Hou, A., & Wang, Y. (2018). On a Dynamic Data Placement Strategy for Heterogeneous Hadoop Clusters. 2018 International Symposium on Networks, Computers and Communications, ISNCC 2018, 5, 1–7. doi:10.1109/ISNCC.2018.8530970.

Xiong, R., Du, Y., Jin, J., & Luo, J. (2018). HaDaap: a hotness‐aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters. Concurrency and Computation: Practice and Experience, 30(20), e4830. doi:10.1002/cpe.4830.

Eltabakh, M. Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., & McPherson, J. (2011). CoHadoop: Flexible data placement and its exploitation in Hadoop. Proceedings of the VLDB Endowment, 4(9), 575–585. doi:10.14778/2002938.2002943.

Hussain, M. W., & Roy, D. S. (2022). A Counter-Based Profiling Scheme for Improving Locality through Data and Reducer Placement. Intelligent Systems Reference Library, 218, 101–118. doi:10.1007/978-981-16-8930-7_4.

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop YARN: Yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing, SoCC 2013, 1–16. doi:10.1145/2523616.2523633.

Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., & Qin, X. (2010). Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Ph.D. Forum, IPDPSW 2010, 1–9. doi:10.1109/IPDPSW.2010.5470880.

Vengadeswaran, S., Balasundaram, S. R., & Dhavakumar, P. (2024). IDaPS — Improved data-locality aware data placement strategy based on Markov clustering to enhance MapReduce performance on Hadoop. Journal of King Saud University - Computer and Information Sciences, 36(3), 101973. doi:10.1016/j.jksuci.2024.101973.

Kumar, K. A., Deshpande, A., & Khuller, S. (2013). Data placement and replica selection for improving co-location in distributed environments. arXiv preprint, arXiv:1302.4168. doi:10.48550/arXiv.1302.4168.

Wu, J. xuan, Zhang, C. sheng, Zhang, B., & Wang, P. (2016). A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for Hadoop. Microprocessors and Microsystems, 47, 161–169. doi:10.1016/j.micpro.2016.07.011.

Qureshi, N. M. F., & Shin, D. R. (2016). RDP: A storage-tier-aware robust data placement strategy for hadoop in a cloud-based heterogeneous environment. KSII Transactions on Internet and Information Systems, 10(9), 4063–4086. doi:10.3837/tiis.2016.09.003.

Liu, J., Xie, M., Chen, S., Xu, G., Wu, T., & Li, W. (2023). TS-REPLICA: A novel replica placement algorithm based on the entropy weight TOPSIS method in spark for multimedia data analysis. Information Sciences, 626, 133–148. doi:10.1016/j.ins.2023.01.049.

Vengadeswaran, S., & Balasundaram, S. R. (2020). CLUST - Grouping aware data placement for improving the performance of large-scale data management system. ACM International Conference Proceeding Series, 1–9. doi:10.1145/3371158.3371159.

Liu, L., Song, J., Wang, H., & Lv, P. (2016). BRPS: A Big Data Placement Strategy for Data Intensive Applications. IEEE International Conference on Data Mining Workshops, ICDMW, 813–820. doi:10.1109/ICDMW.2016.0120.

Ciritoglu, H. E., Saber, T., Buda, T. S., Murphy, J., & Thorpe, C. (2018). Towards a Better Replica Management for Hadoop Distributed File System. Proceedings - 2018 IEEE International Congress on Big Data, BigData Congress 2018 - Part of the 2018 IEEE World Congress on Services, 104–111. doi:10.1109/BigDataCongress.2018.00021.

Ciritoglu, H. E., Murphy, J., & Thorpe, C. (2019). HaRD: a heterogeneity-aware replica deletion for HDFS. Journal of Big Data, 6(1), 1-21. doi:10.1186/s40537-019-0256-6.

Dai, W., Ibrahim, I., & Bassiouni, M. (2016). A New Replica Placement Policy for Hadoop Distributed File System. Proceedings - 2nd IEEE International Conference on Big Data Security on Cloud, IEEE BigDataSecurity 2016, 2nd IEEE International Conference on High Performance and Smart Computing, IEEE HPSC 2016 and IEEE International Conference on Intelligent Data and Security, IEEE IDS 2016, 262–267. doi:10.1109/BigDataSecurity-HPSC-IDS.2016.30.

Bui, D. M., Hussain, S., Huh, E. N., & Lee, S. (2016). Adaptive Replication Management in HDFS Based on Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, 28(6), 1369–1382. doi:10.1109/TKDE.2016.2523510.

Ahmed, M. A., Khafagy, M. H., Shaheen, M. E., & Kaseb, M. R. (2023). Dynamic Replication Policy on HDFS Based on Machine Learning Clustering. IEEE Access, 11, 18551–18559. doi:10.1109/ACCESS.2023.3247190.

Fazul, R. W. A., & Barcelos, P. P. (2022). An event-driven strategy for reactive replica balancing on apache hadoop distributed file system. Proceedings of the ACM Symposium on Applied Computing, 255–263. doi:10.1145/3477314.3507311.

Zayed, N. A., Saleh, Y. N. M., Aboelfarag, A. A., & Shaheen, M. A. (2024). Optimizing Hadoop Distributed File System Replication Policies with Predictive Categorization. ACM International Conference Proceeding Series, 26–32. doi:10.1145/3694860.3694864.

He, Q., Zhang, F., Bian, G., Zhang, W., Li, Z., & Chen, C. (2023). Dynamic decision-making strategy of replica number based on data hot. Journal of Supercomputing, 79(9), 9584–9603. doi:10.1007/s11227-022-05029-7.

He, Q., Zhang, F., Bian, G., Zhang, W., Li, Z., Yu, Z., & Feng, H. (2024). File block multi-replica management technology in cloud storage. Cluster Computing, 27(1), 457–476. doi:10.1007/s10586-022-03952-1.

Wang, Z., Li, T., Xiong, N., & Pan, Y. (2012). A novel dynamic network data replication scheme based on historical access record and proactive deletion. Journal of Supercomputing, 62(1), 227–250. doi:10.1007/s11227-011-0708-z.


Full Text: PDF

DOI: 10.28991/HIJ-2025-06-01-03

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Nour-Eddine BAKNI, Ismail ASSAYAD