Structure-Enhanced Pop Music Generation via Harmony-Aware Learning
Xueyao Zhang, Jinchao Zhang, Yao Qiu, Li Wang, Jie Zhou

The Chinese University of Hong Kong, Shenzhen, China
Pattern Recognition Center, WeChat AI, Tencent Inc, China

Proceedings of the 30th ACM International Conference on Multimedia (ACM MM 2022)

Abstract

Automatically composing pop music with a satisfactory structure is an attractive but challenging topic. Although the musical structure is easy to be perceived by humans, it is difficult to be described clearly and defined accurately. And it is still far from being solved how we should model the structure in pop music generation. In this paper, we propose to leverage harmony-aware learning for structure-enhanced pop music generation. On the one hand, one of the participants of harmony, chord, represents the harmonic set of multiple notes, which is integrated closely with the spatial structure of music, the texture. On the other hand, the other participant of harmony, chord progression, usually accompanies the development of the music, which promotes the temporal structure of music, the form. Moreover, when chords evolve into chord progression, the texture and form can be bridged by the harmony naturally, which contributes to the joint learning of the two structures. Furthermore, we propose the Harmony-Aware Hierarchical Music Transformer (HAT), which can exploit the structure adaptively from the music, and make the musical tokens interact hierarchically to enhance the structure in multi-level musical elements. Experimental results reveal that compared to the existing methods, HAT owns a much better understanding of the structure and it can also improve the quality of generated music, especially in the form and texture.

Compared Methods

  • Our proposed Harmony-Aware Hierarchical Music Transformer (HAT)
  • Music Transformer (Cheng-Zhi Anna Huang et al., ICLR'19)
  • CP-Transformer (Wen-Yi Hsiao et al., AAAI'21)
  • Dataset

    We use POP909 (and its phrase-level annotation) as our experimental dataset.

    Code

    The source code is available at here.

    Top pieces in the subjective evaluation

    Note: In the following MIDI players, the Melody track, the Accompaniment track, and the Bridge track are respectively visualized in red, gray, and blue.

    HAT
    Top 1
    Average Score: 0.94 (Melody: 0.89, Groove: 0.92; Primary Melody: 1.00, Consonance: 0.83; Coherence: 1.00, Integrity: 1.00)
    Top 2
    Average Score: 0.92 (Melody: 0.83, Groove: 1.00; Primary Melody: 0.78, Consonance: 1.00; Coherence: 0.89, Integrity: 1.00)
    Top 3
    Average Score: 0.85 (Melody: 1.00, Groove: 0.93; Primary Melody: 0.93, Consonance: 0.80; Coherence: 0.86, Integrity: 0.56)
    Top 4
    Average Score: 0.77 (Melody: 0.63, Groove: 0.80; Primary Melody: 0.89, Consonance: 0.77; Coherence: 0.72, Integrity: 0.78)
    Top 5
    Average Score: 0.73 (Melody: 0.78, Groove: 0.50; Primary Melody: 0.78, Consonance: 0.78; Coherence: 0.67, Integrity: 0.89)
    Music Transformer
    Top 1
    Average Score: 0.74 (Melody: 0.67, Groove: 0.75; Primary Melody: 0.80, Consonance: 0.75; Coherence: 1.00, Integrity: 0.50)
    Top 2
    Average Score: 0.55 (Melody: 0.67, Groove: 0.50; Primary Melody: 0.80, Consonance: 0.25; Coherence: 0.60, Integrity: 0.50)
    Top 3
    Average Score: 0.40 (Melody: 0.33, Groove: 0.25; Primary Melody: 0.60, Consonance: 0.75; Coherence: 0.20, Integrity: 0.25)
    CP-Transformer
    Top 1
    Average Score: 0.70 (Melody: 0.76, Groove: 0.58; Primary Melody: 0.69, Consonance: 0.75; Coherence: 0.75, Integrity: 0.69)
    Top 2
    Average Score: 0.67 (Melody: 0.72, Groove: 0.72; Primary Melody: 0.61, Consonance: 0.44; Coherence: 0.78, Integrity: 0.72)
    Top 3
    Average Score: 0.56 (Melody: 0.39, Groove: 0.39; Primary Melody: 0.61, Consonance: 0.50; Coherence: 0.56, Integrity: 0.89)
    Case Study

    We adopt the intro of the Chinese pop song, Guang Yin De Gu Shi, as a prompt for different models. Their generated pieces are as follows.

    Real piece
    HAT
    Music Transformer
    CP-Transformer