Heterogeneous Digital Music Generation Techniques Incorporating Fine-Grained Controls

Qiaomei Ma

doi:10.28991/HIJ-2026-07-01-014

Authors

Qiaomei Ma
qiaomeimaqm@outlook.com
College of Music, Xinjiang Normal University, Urumqi, 830053, China https://orcid.org/0009-0006-8953-6857

Vol. 7 No. 1 (2026): March

Research Articles

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Aiming at the problem of insufficient note-level attribute modulation and the difficulty of fusion of cross-genre musical elements, the study proposes a hierarchical conditional embedding mechanism and a symbolic feature conditional diffusion method. Through the dynamic gated fusion of note-structure and adaptive acoustic modulation guided by symbols, it optimizes the millisecond precision generation of melodic rhythm and the synergistic control efficiency of high-fidelity audio synthesis. This enables fine-grained controlled generation of cross-cultural heterogeneous music. The experimental results indicated that the model achieved 96.2% note localization accuracy in cross-cultural scenarios, which was 12.8% higher than the benchmark. The minimum value of beat synchronization deviation was 1.7 ms, which was 52.9% lower than the optimal comparison model. The average value of polyphony duration was 70.6%, an improvement of 9.8%. Differential scale fusion reached 12.5 tone level, breaking through the limit of twelve equal temperament. The peak memory occupation was 198.3 MB, and the energy consumption of a single song was as low as 0.142 kWh, reducing energy consumption by 29.4% compared with the traditional solution. The professional composition evaluation revealed that the cultural coordination degree of the heterogeneous style fusion fragment was 92.1%. The real-time generation delay stabilized at 2.8 ms, and the generation quality improved by 38.7% compared to the industrial standard. These results proved the model's comprehensive advantages in cross-dimensional control and artistic expression. This model can be integrated into digital audio workstations (DAWs) as either a plug-in or a cloud API. It provides creators with real-time, interactive generation and style transfer capabilities. It provides intuitive control over both macro-level structure and micro-level acoustic details via natural language commands or symbolic input. This significantly lowers the barrier to high-quality, AI-assisted creation. This drives the popularization and application of cross-cultural music fusion exploration.

[1] Liu, W. (2023). Literature survey of multi-track music generation model based on generative confrontation network in intelligent composition. Journal of Supercomputing, 79(6), 6560–6582. doi:10.1007/s11227-022-04914-5.

[2] Ji, S., Yang, X., & Luo, J. (2023). A Survey on Deep Learning for Symbolic Music Generation: Representations, Algorithms, Evaluations, and Challenges. ACM Computing Surveys, 56(1), 1–39. doi:10.1145/3597493.

[3] Wang, L., Zhao, Z., Liu, H., Pang, J., Qin, Y., & Wu, Q. (2024). A review of intelligent music generation systems. Neural Computing and Applications, 36(12), 6381–6401. doi:10.1007/s00521-024-09418-2.

[4] Dash, A., & Agres, K. (2024). AI-Based Affective Music Generation Systems: A Review of Methods and Challenges. ACM Computing Surveys, 56(11), 1–34. doi:10.1145/3672554.

[5] Asplund, J. (2022). Compositionism and digital music composition education. Journal for Research in Arts and Sports Education, 6(3), 96–120. doi:10.23865/jased.v6.3578.

[6] Turchet, L., & Antoniazzi, F. (2023). Semantic Web of Musical Things: Achieving interoperability in the Internet of Musical Things. Journal of Web Semantics, 75, 100758. doi:10.1016/j.websem.2022.100758.

[7] Gioti, A. M., Einbond, A., & Born, G. (2022). Composing the Assemblage: Probing Aesthetic and Technical Dimensions of Artistic Creation with Machine Learning. Computer Music Journal, 46(4), 62–80. doi:10.1162/comj_a_00658.

[8] Liang, Y., Abudukelimu, H., Chen, J., Abulizi, A., & Guo, W. (2025). MAML-XL: a symbolic music generation method based on meta-learning and Transformer-XL. Multimedia Systems, 31(3), 1–15. doi:10.1007/s00530-025-01803-8.

[9] Alexanderson, S., Nagy, R., Beskow, J., & Henter, G. E. (2023). Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics, 42(4), 1–20. doi:10.1145/3592458.

[10] Min, J., Gao, Z., & Wang, L. (2025). Application and Research of Music Generation System Based on CVAE and Transformer-XL in Video Background Music. IEEE Transactions on Industrial Informatics, 21(2), 1409–1418. doi:10.1109/TII.2024.3477561.

[11] Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., & Yu, D. (2023). Diffsound: Discrete Diffusion Model for Text-to-Sound Generation. IEEE/ACM Transactions on Audio Speech and Language Processing, 31(1), 1720–1733. doi:10.1109/TASLP.2023.3268730.

[12] Groumpos, P. P. (2023). A Critical Historic Overview of Artificial Intelligence: Issues, Challenges, Opportunities, and Threats. Artificial Intelligence and Applications, 1(4), 181–197. doi:10.47852/bonviewAIA3202689.

[13] Yin, Z., Reuben, F., Stepney, S., & Collins, T. (2022). Measuring When a Music Generation Algorithm Copies Too Much: The Originality Report, Cardinality Score, and Symbolic Fingerprinting by Geometric Hashing. SN Computer Science, 3(5), 340–341. doi:10.1007/s42979-022-01220-y.

[14] Wu, S. L., Donahue, C., Watanabe, S., & Bryan, N. J. (2024). Music ControlNet: Multiple Time-Varying Controls for Music Generation. IEEE/ACM Transactions on Audio Speech and Language Processing, 32(1), 2692–2703. doi:10.1109/TASLP.2024.3399026.

[15] Shih, Y. J., Wu, S. L., Zalkow, F., Muller, M., & Yang, Y. H. (2023). Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer. IEEE Transactions on Multimedia, 25(3), 3495–3508. doi:10.1109/TMM.2022.3161851.

[16] Jin, C., Wang, T., Li, X., Tie, C. J. J., Tie, Y., Liu, S., Yan, M., Li, Y., Wang, J., & Huang, S. (2022). A transformer generative adversarial network for multi-track music generation. CAAI Transactions on Intelligence Technology, 7(3), 369–380. doi:10.1049/cit2.12065.

[17] Han, B., Li, Y., Shen, Y., Ren, Y., & Han, F. (2024). Dance2MIDI: Dance-driven multi-instrument music generation. Computational Visual Media, 10(4), 791–802. doi:10.1007/s41095-024-0417-1.

[18] Wu, S. L., & Yang, Y. H. (2023). MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer with One Transformer VAE. IEEE/ACM Transactions on Audio Speech and Language Processing, 31(1), 1953–1967. doi:10.1109/TASLP.2023.3270726.

[19] Shukla, S., & Banka, H. (2022). Monophonic music composition using genetic algorithm and Bresenham’s line algorithm. Multimedia Tools and Applications, 81(18), 26483–26503. doi:10.1007/s11042-022-12185-8.

[20] Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., & Zeghidour, N. (2023). AudioLM: A Language Modeling Approach to Audio Generation. IEEE/ACM Transactions on Audio Speech and Language Processing, 31(1), 2523–2533. doi:10.1109/TASLP.2023.3288409.

[21] Mehra, A., Mehra, A., & Narang, P. (2025). Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM). Multimedia Tools and Applications, 84(7), 3701–3721. doi:10.1007/s11042-024-19160-5.

[22] Tie, Y., Guo, X., Zhang, D., Tie, J., Qi, L., & Lu, Y. (2025). Hybrid Learning Module-Based Transformer for Multitrack Music Generation with Music Theory. IEEE Transactions on Computational Social Systems, 12(2), 862–872. doi:10.1109/TCSS.2024.3486604.

[23] Wang, X. (2025). CNN-Transformer architecture for piano performance style recognition and AI-based real-time music accompaniment. Proc. Seventh International Conference on Image, Video Processing, and Artificial Intelligence (IVPAI, SPIE), 13731, 69. doi:10.1117/12.3076216.

[24] Sams, A. S., & Zahra, A. (2023). Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers. Bulletin of Electrical Engineering and Informatics, 12(1), 355–364. doi:10.11591/eei.v12i1.4231.

[25] Zhou, L., Yin, L., Qian, Y., & Wang, M. (2025). MLAT: a multi-level attention transformer capturing multi-level information in compound word encoding for symbolic music generation. Eurasip Journal on Audio, Speech, and Music Processing, 2025(1), 17–18. doi:10.1186/s13636-025-00407-4.

[26] Hu, Z., Liu, Y., Chen, G., & Liu, Y. (2023). Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer. IEEE Transactions on Multimedia, 25(1), 2296–2308. doi:10.1109/TMM.2022.3146002.

[27] Wang, W., Li, J., Li, Y., & Xing, X. (2024). Style-conditioned music generation with Transformer-GANs. Frontiers of Information Technology and Electronic Engineering, 25(1), 106–120. doi:10.1631/FITEE.2300359.

[28] Ding, F., & Cui, Y. (2023). MuseFlow: music accompaniment generation based on flow. Applied Intelligence, 53(20), 23029–23038. doi:10.1007/s10489-023-04664-8.

[29] Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., & Xia, S. (2022). Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Transactions on Multimedia Computing, Communications and Applications, 18(2), 1–21. doi:10.1145/3485664.

[30] Aditya Kumar, & Ankita Lal. (2024). Applying Recurrent Neural Networks with integrated Attention Mechanism and Transformer Model for Automated Music Generation. International Journal on Smart & Sustainable Intelligent Computing, 1(2), 58–69. doi:10.63503/j.ijssic.2024.34.

[31] Lemercier, J. M., Richter, J., Welker, S., Moliner, E., Valimaki, V., & Gerkmann, T. (2024). Diffusion Models for Audio Restoration: A review “Special Issue On Model-Based and Data-Driven Audio Signal Processing”. IEEE Signal Processing Magazine, 41(6), 72–84. doi:10.1109/MSP.2024.3445871.

[32] Cheuk, K. W., Sawata, R., Uesaka, T., Murata, N., Takahashi, N., Takahashi, S., Herremans, D., & Mitsufuji, Y. (2023). Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining Capability. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/icassp49357.2023.10095935.

[33] Li, C., Yang, F., & Yang, J. (2024). A Two-Stage Approach to Quality Restoration of Bone-Conducted Speech. IEEE/ACM Transactions on Audio Speech and Language Processing, 32(2), 818–829. doi:10.1109/TASLP.2023.3337988.

[34] Ren, L., Wang, H., & Laili, Y. (2024). Diff-MTS: Temporal-Augmented Conditional Diffusion-Based AIGC for Industrial Time Series toward the Large Model Era. IEEE Transactions on Cybernetics, 54(12), 7187–7197. doi:10.1109/TCYB.2024.3462500.

[35] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., & Yang, M. H. (2024). Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Computing Surveys, 56(4), 1–39. doi:10.1145/3626235.

[36] Sun, L., Yuan, S., Gong, A., Ye, L., & Chng, E. S. (2024). Dual-Branch Modeling Based on State-Space Model for Speech Enhancement. IEEE/ACM Transactions on Audio Speech and Language Processing, 32(1), 1457–1467. doi:10.1109/TASLP.2024.3362691.

[37] Yin, Z., Reuben, F., Stepney, S., & Collins, T. (2023). Deep learning’s shallow gains: a comparative evaluation of algorithms for automatic music generation. Machine Learning, 112(5), 1785–1822. doi:10.1007/s10994-023-06309-w.

[38] Zeng, D. (2025). AI-Powered Choreography Using a Multilayer Perceptron Model for Music-Driven Dance Generation. Informatica (Slovenia), 49(20), 137–148. doi:10.31449/inf.v49i20.8103.

[39] Moliner, E., & Välimäki, V. (2023). BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks. IEEE/ACM Transactions on Audio Speech and Language Processing, 31(2), 943–956. doi:10.1109/TASLP.2022.3190726.

[40] Yadav, N., Kumar Singh, A., & Pal, S. (2022). Improved self-attentive Musical Instrument Digital Interface content-based music recommendation system. Computational Intelligence, 38(4), 1232–1257. doi:10.1111/coin.12501.

[41] Kanwal, T., Mahum, R., AlSalman, A. M., Sharaf, M., & Hassan, H. (2024). Fake speech detection using VGGish with attention block. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1), 35. doi:10.1186/s13636-024-00348-4.

[42] Zhang, M., Zhu, Y., Zhang, W., Zhu, Y., & Feng, T. (2023). Modularized composite attention network for continuous music emotion recognition. Multimedia Tools and Applications, 82(5), 7319–7341. doi:10.1007/s11042-022-13577-6.

[43] Doh, S., Lee, J., Jeong, D., & Nam, J. (2025). Musical Word Embedding for Music Tagging and Retrieval. IEEE Transactions on Audio, Speech and Language Processing, 33(1), 2377–2387. doi:10.1109/taslpro.2025.3577408.

[44] Lv, L., Wu, S., Su, Y., Jiang, C., & Kuang, L. (2024). Dual-Stream Reconstruction Network-Aided ESPRIT Algorithm for DoA Estimation in Coherent Scenarios. IEEE Transactions on Vehicular Technology, 73(12), 19669–19681. doi:10.1109/TVT.2024.3444911.

Acceptance Rate:	27%
Review Speed:	61 days
Issue Per Year:	4
Number of Volumes:	5
Number of Issues:	19
Number of Articles:	193
Number of Reviewers:	372
Number of Contributors:	530
Contributing Countries:	63
No. of Scopus Citations:	1289
No. of WoS Citations:	1187
No. of Google Citations:	1470
Google h-index:	21
Google i10-index:	45
Abstract Views:	123,086
PDF Download:	103,923

Heterogeneous Digital Music Generation Techniques Incorporating Fine-Grained Controls

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

Journal Imprint

Most Cited Articles

Towards Bayesian Quantification of Permeability in Micro-scale Porous Structures – The Database of Micro Networks

Physicochemical and Microstructural Characterization of Klias Peat, Lumadan POFA, and GGBFS for Geopolymer Based Soil Stabilization

Seismic Upgradation of RC Beams Strengthened with Externally Bonded Spent Catalyst Based Ferrocement Laminates

Temporal Trends of Rainfall and Temperature over Two Sub-Divisions of Western Ghats

IndexedBy

Indexed In

twitter

Social Media

Analytics

Analytics

Information

Address

Contact Info:

Heterogeneous Digital Music Generation Techniques Incorporating Fine-Grained Controls

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

Journal Imprint

Journal Imprint

Journal Metrics

Most Cited Articles

Towards Bayesian Quantification of Permeability in Micro-scale Porous Structures – The Database of Micro Networks

Physicochemical and Microstructural Characterization of Klias Peat, Lumadan POFA, and GGBFS for Geopolymer Based Soil Stabilization

Seismic Upgradation of RC Beams Strengthened with Externally Bonded Spent Catalyst Based Ferrocement Laminates

Temporal Trends of Rainfall and Temperature over Two Sub-Divisions of Western Ghats

IndexedBy

Indexed In

twitter

Social Media

Analytics

Analytics

Information