Singing Voice Conversion Tutorial
Xueyao Zhang

The Chinese University of Hong Kong, Shenzhen

Contents
  • What is Singing Voice Conversion?
  • Intra-singer conversion
  • Inter-singer conversion
  • Cross-domain conversion
  • Application and User Scenarios
  • Imitation and Entertainment
  • Singing Voice Beautification
  • Education
  • Creative Art
  • Recommended Datasets
  • Recommended Papers
  • Paradigm of the conversion framework (Basics)
  • To model singer independent features
  • To model singer dependent features
  • To introduce singing voice domain knowledge
  • Baseline: WORLD-based SVC
  • What is WORLD?
  • Overview
  • Implementation Details
  • Demo
  • Citation
  • What is Singing Voice Conversion?

    Sing voice conversion (SVC) is to convert singing voice to our desired targets. There are three common SVC tasks:

  • Intra-singer conversion: To convert a singer's voice to a desired timbre (eg: to a more beautiful voice).
  • Inter-singer conversion: To convert a singer's voice to another singers' one.
  • Cross-domain conversion: It means the source audio and target audio are from different domains. For example, to convert a speech voice to singing voice.
  • Here are some examples:

    Intra-singer conversion

    Source:

    Target:

    Jiawei Li. Use more chest resonance for increasing your singing’s power. Bilibili, 2022.

    Inter-singer conversion

    Source:

    Reference:

    Target (source singer's content + reference singer's timbre):

    Chao Wang, et al. Towards High-Fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding. InterSpeech 2022.

    Cross-domain conversion

    Source (singing voice):

    Reference (speech):

    Target (source singer's content + reference speaker's timbre):

    Heyang Xue, et al. Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher. InterSpeech 2022.

    Application and User Scenarios
    Imitation and Entertainment
    Impression Show to various singers:


    Ya Yue. Arrangements for 姐就是女王 as different singers' styles. Bilibili, 2022.

    Singing Voice Beautification
    Tone Tuning:


    Magic of Tuner. Bilibili, 2022.

    Education
    Vocal music teaching:


    Jiawei Li. Use more chest resonance for increasing your singing’s power. Bilibili, 2022.

    Creative Art
    Translate vocal music to instrumental music:


    Peter Bence. Beat it (piano cover). Bilibili, 2022.

    Recommended Datasets

  • NUS-48E (English): 2.8 hours, 12 singers, 20 unique songs. [1]
  • Opencpop (Mandarian): 5.25 hours, 1 singers, 100 unique songs. [2]
  • M4Singer (Mandarian): 30.59 hours, 19 singers, 419 unique songs. [3]
  • [1] Zhiyan Duan, et al. The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. APSIPA 2013.
    [2] Yu Wang, et al. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis. InterSpeech 2022.
    [3] Lichao Zhang, et al. M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus. NeurIPS 2022.

    Recommended Papers
    Here I listed some SVC papers with non-parallel data. "Non-parallel" means that we do not have the (source audio, target audio) pairs data. You can refer to this review to know more about it.
    Paradigm of the conversion framework (Basics)

  • Xin Chen, et al. Singing Voice Conversion with Non-parallel Data. IEEE MIPR 2019.
  • Eliya Nachmani, et al. Unsupervised Singing Voice Conversion. InterSpeech 2019.
  • To model singer independent features

  • Zhonghao Li, et al. PPG-Based Singing Voice Conversion with Adversarial Representation Learning. ICASSP 2021.
  • Jordi Bonada, et al. Semi-supervised Learning for Singing Synthesis Timbre. ICASSP 2021.
  • To model singer dependent features

  • Xu Li, et al. A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion. InterSpeech 2022.
  • To introduce singing voice domain knowledge

  • Chengqi Deng et al. PitchNet: Unsupervised Singing Voice Conversion with Pith Adversarial Network. ICASSP 2020.
  • Haohan Guo et al. Improving Adversarial Waveform Generation Based Singing Voice Conversion with Harmonic Signals. ICASSP 2022.
  • Tae-Woo Kim et al. Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis. InterSpeech 2022.
  • Baseline: WORLD-based SVC

    ✍   Notes

    In this tutorial, the provided WORLD-based SVC baseline is classic. To know more about the contemporary SVC (such as using a neural vocoder and a stronger conversion model), you can refer to my latest work.

    All the source code of this baseline can be seen here.
    What is WORLD?

    WORLD [1] is a classical vocoder in Digital Signal Processing (DSP). It supposes there are three acoustic components of an audio: Fundamental frequency (F0), Spectral envelope (SP), and Aperiodic parameter (AP). It can be considered not only an extractor, which can extract F0, SP, and AP efficiently, but also a synthesizer, which can synthesis an audio by incorporating the three.

    Compared to the recent neural-based vocoders (such as WaveNet, WaveRNN, Hifi-GAN, Diffwave, etc.), WORLD is more like a white box and is more controllable and manipulable, while its synthesis quality is worse. We can easily adjust the input parameters to change the synthesized audios. For example, given such a male singing voice:

    We can keep the F0 unchanged, and convert it to a robot-like voice:

    Or, we can also increase the F0 two times, divide the SP by 1.2, and convert it to a female-like voice:

    The official code of WORLD has been released here. You can play it, and manipulate the voice as you like!

    [1] Masanori Morise, et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans. Inf. Syst. 2016.
    Overview

    Paradigm of the conversion framework

    The conversion framework of the figure above is proposed by [1]. There are two main modules of it, Content Extractor and Singer-specific Synthesizer. Given a source audio, firstly the Content Extractor is aim to extract content features (i.e., singer independent features) from the audio. Then, the Singer-specific Synthesizer is designed to inject the singer dependent features for the synthesis, so that the target audio can be able to capture the singer's characteristics.

    The authors of [1] assume that among the three components of WORLD (i.e., F0, AP, and SP), only SP is singer dependent and should be modeled by the Singer-specific Synthesizer, while the other two can be considered pure content features. Based on that, we can utilize the following two stages to conduct any-to-one conversion:

  • Acoustics Mapping Training (Training Stage): This stage is to train the mapping from the textual content features (eg: PPG) to the target singer's acoustic features (eg: SP or MCEP). The training model can be a neural network like Bi-LSTM [1].
  • Inference and Conversion (Conversion Stage): Given any source singer's audio, firstly, extract its content features including F0, AP, and textual content features. Then, use the model of training stage to infer the converted acoustic features (SP or MCEP). Finally, given F0, AP, and the converted SP, we utilize WORLD as vocoder to synthesis the converted audio.
  • [1] Xin Chen, et al. Singing Voice Conversion with Non-parallel Data. IEEE MIPR 2019.

    Implementation Details

    The experiment setting is many-to-one conversion. Specifically, we consider Opencpop (which is a single singer dataset) as target singer and use M4Singer (which is a 19-singer dataset) as source singers.

    We adopt a python wrapper of WORLD to extract F0, AP, and SP, and to synthesis audios. We use diffsptk to transform between SP and MCEP. We utilize the last layer encoder's output of Whisper as the content features (which is 1024d). During training stage, we use a 6-layer Transformer to train the mapping from whisper features to 40d MCEP features.

    Demo
    Target Singer Samples (Opencpop):
    Conversion Samples (Convert different singers of M4Singer to Opencpop):
    Source Ground Truth WORLD Converted Result
    Soprano-2
    Soprano-3
    Alto-4
    Alto-5
    Tenor-6
    Tenor-7
    Bass-2
    Bass-3
    Citation
    Cited as

    Xueyao Zhang. Singing Voice Conversion Tutorial. RMSnow's Blog. Jan 2023. https://www.zhangxueyao.com/data/SVC/tutorial.html.

    Or

    @article{zhang2023svc,
      title = "Singing Voice Conversion Tutorial",
      author = "Xueyao Zhang",
      journal = "RMSnow's Blog",
      year = "2023",
      month = "Jan",
      url = "https://www.zhangxueyao.com/data/SVC/tutorial.html"
    }