《现代电影技术》｜熊晓钰等：基于多粒度注意力Transformer的电影音乐生成研究|流派|music|symbolic|基于多粒度注意力transformer的电影音乐生成研究

分享至

本文刊发于《现代电影技术》2024年第9期

专家点评

近年来，人工智能生成内容（AIGC）技术迅猛发展，其主流模型框架以深度神经网络为基础，由早期的GAN、VAE向Transformer、Diffusion与DiT（Diffusion Transformer）发展演进。其中，大语言模型（LLM）文本生成技术日渐成熟，引领推动图像与声音生成技术的发展，并通过不断增强可控性以满足日益增长的个性化创作需求。音乐作为电影不可或缺的表达元素，顺应AIGC技术的发展与应用，AI音乐生成正逐步成为电影配乐创作的革新力量，迄今已分化出符号生成与音频生成两种技术路线，但现有方法对音乐流派等控制条件关注不足，一定程度上影响了音乐生成质量和多样性的提升。《基于多粒度注意力Transformer的电影音乐生成研究》一文以编码后的流派信息作为条件输入从零生成符号音乐，结合音乐重复周期的结构特点，采用多粒度注意力机制Transformer架构捕获音乐结构和上下文信息，并引入流派分类判别器，输出流派分类概率用于识别判断，为音乐生成提供风格控制。本方法在流派控制效果、音乐质量结构等方面较同类方法有较大提升，但在实用性上仍有改进空间，有待进一步研究探索。

——王萃

正高级工程师

中国电影科学技术研究所（中央宣传部电影技术质量检测所）高新技术研究处副处长

作者简介

熊晓钰

上海大学上海电影学院2021级硕士研究生，主要研究方向：深度学习、电影音乐生成。

上海大学上海电影学院、上海电影特效工程技术中心副教授、博士生导师，主要研究方向：电影高新技术、人工智能。

谢志峰

黄登云

上海大学上海电影学院2023级硕士研究生，主要研究方向：深度学习、电影音乐生成。

上海大学上海电影学院副教授、硕士生导师，主要研究方向：人工智能、计算机应用。

朱永华

摘要

电影音乐自动生成是当前人工智能领域的研究热点之一，不少深度学习音乐生成算法可实现动听的电影配乐生成，但这些算法在生成过程中往往忽略了流派等风格控制。针对这一情况，本文提出了一种基于多粒度注意力Transformer的电影音乐生成方法，可根据目标流派从零生成音乐。本方法在引入多粒度注意力Transformer建模音乐结构的基础上，引入了对抗学习机制，通过具有流派分类损失和生成对抗损失的流派辅助分类判别器，加强模型对流派信息的控制。在所构建的包含流派信息的符号音乐数据集上进行的主客观实验表明，本文方法在生成音乐质量和流派控制方面均优于以往方法，有助于基于目标流派自动生成电影配乐。

关键词

音乐生成；流派控制；生成式对抗网络；Transformer；电影音乐

1引言

2相关研究

2.1 基于深度学习的符号音乐生成

2.2 可控音乐生成

3本文方法

3.1 整体网络框架

3.2 数据表示

3.3 多粒度注意力Transformer

3.4 流派辅助分类判别器

4实验结果及分析

4.1 数据集

4.2 实验设置

4.3 客观评价

4.4 主观评价

5结语

参考文献

（向下滑动阅读）

[1] 陈吉尚, 哈里旦木·阿布都克里木, 梁蕴泽, 等. 深度学习在符号音乐生成中的应用研究综述[J]. 计算机工程与应用, 2023, 59(09): 27⁃45.

[2] Roberts A, Engel J, Raffel C, et al. A hierarchical latent vector model for learning long⁃term structure in music[C]//International conference on machine learning. PMLR, 2018: 4364⁃4373.

[3] Brunner G, Konrad A, Wang Y, et al. MIDI⁃VAE: Modeling dynamics and instrumentation of music with applications to style transfer[C]//ISMIR 2019: 343⁃351.

[4] Mogren O. C⁃RNN⁃GAN: Continuous recurrent neural networks with adversarial training[C]//Conference and Workshop on Neural Information Processing Systems, 2016: 1⁃6.

[5] Yang L C, Chou S Y, Yang Y H. MidiNet: A convolutional generative adversarial network for symbolic⁃domain music generation[C]//Proceedings of International Society for Music Information Retrieval Conference, 2017: 324⁃331.

[6] Dong H W, Hsiao W Y, Yang L C, et al. MuseGAN: Multi⁃track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018: 34⁃41.

[7] Huang C Z A, Vaswani A, Uszkoreit J, et al. Music Transformer: Generating music with long⁃term structure[C]// International Conference on Learning Representations,2018: 123⁃131.

[8] Huang Y S, Yang Y H. Pop Music Transformer: Beat⁃based modeling and generation of expressive pop piano compositions[C]//Proceedings of the 28th ACM international conference on multimedia, 2020: 1180⁃1188.

[9] Dai Z, Yang Z, Yang Y, et al. Transformer⁃XL: Attentive language models beyond a fixed⁃length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978⁃2988.

[10] Hsiao W Y, Liu J Y, Yeh Y C, et al. Compound Word Transformer: Learning to compose full⁃song music over dynamic directed hypergraphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 178⁃186.

[11] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are rnns: Fast autoregressive transformers with linear attention[C]//International Conference on Machine Learning, Proceedings of Machine Learning Research, 2020: 5156⁃5165.

[12] Zhang N. Learning adversarial transformer for symbolic music generation[J]. IEEE transactions on neural networks and learning systems, 2020, 34(4): 1754⁃1763.

[13] Muhamed A, Li L, Shi X, et al. Symbolic music generation with transformer⁃gans [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2021:408⁃417.

[14] Wang L, Zhao Z, Liu H, et al. A review of intelligent music generation systems[J]. Neural Computing and Applications, 2024: 1⁃21.

[15] Mao H H, Shin T, Cottrell G. DeepJ: Style⁃specific music generation [C]//2018 IEEE 12th International Conference on Semantic Computing (ICSC). IEEE, 2018: 377⁃382.

[16] Johnson D D. Generating polyphonic music using tied parallel networks[C]//International conference on evolutionary and biologically inspired music and art. Cham: Springer International Publishing, 2017: 128⁃143.

[17] Wang Z, Wang D, Zhang Y, et al. Learning interpretable representation for controllable polyphonic music generation[C]//Proceedings of the 21st International Society for Music Information Retrieval Conference, 2020: 662–669.

[18] Choi K, Hawthorne C, Simon I, et al. Encoding musical style with transformer autoencoders[C]//International conference on machine learning. PMLR, 2020: 1899⁃1908.

[19] Di S, Jiang Z, Liu S, et al. Video background music generation with controllable music transformer[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 2037⁃2045.

[20] Zhuo L, Wang Z, Wang B, et al. Video background music generation: Dataset, method and evaluation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15637⁃15647.

[21] Hung H T, Ching J, Doh S, et al. EMOPIA: A multi⁃modal pop piano dataset for emotion recognition and emotion⁃based music generation[C]//Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021: 318⁃325.

[22] Kang J, Poria S, Herremans D. Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model[J]. Expert Systems with Applications, 2024, 249: 123640.

[23] Ding Z, Liu X, Zhong G, et al. Steelygan: Semantic unsupervised symbolic music genre transfer[C]//Chinese Conference on Pattern Recognition and Computer Vision, Cham: Springer International Publishing, 2022: 305⁃317.

[24] Wu S L, Yang Y H. MuseMorphose: Full⁃song and fine⁃grained piano music style transfer with one transformer VAE[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1953⁃1967.

[25] Huang H, Wang Y, Li L, et al. Music style transfer with diffusion model[C]// Proceedings of International Computer Music Conference, 2023: 39⁃46.

[26] Wang W, Li X, Jin C, et al. CPS: full⁃song and style⁃conditioned music generation with linear transformer[C]//2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2022: 1⁃6.

[27] Sarmento P, Kumar A, Chen Y H, et al. GTR⁃CTRL: Instrument and genre conditioning for guitar⁃focused music generation with transformers[C]//International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar), Cham: Springer Nature Switzerland, 2023: 260⁃275.

[28] Yu B, Lu P, Wang R, et al. Museformer: Transformer with fine⁃and coarse⁃grained attention for music generation[J]. Advances in Neural Information Processing Systems, 2022, 35: 1376⁃1388.

[29] Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier gans[C]//International conference on machine learning. PMLR, 2017: 2642⁃2651.

[30] Zeng M, Tan X, Wang R, et al. Musicbert: Symbolic music understanding with large⁃scale pre⁃training[C]//Proceedings of Findings of the Association for Computational Linguistics, 2021: 791–800.

主管单位：国家电影局

主办单位：电影技术质量检测所

标准国际刊号：ISSN 1673-3215

国内统一刊号：CN 11-5336/TB

投稿系统：ampt.crifst.ac.cn

官方网站：www.crifst.ac.cn

期刊发行：010-63245081

特别声明：以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布，本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.

手机 / 数码

房产 / 家居

《现代电影技术》｜熊晓钰等：基于多粒度注意力Transformer的电影音乐生成研究