Comprehensive Overview of Singer Voice Synthesis and Model Structures
Explore the world of controllable singing voice synthesis with natural language prompts. Understand the model structures for singing voice synthesis (SVS) and how text encoders extract semantic representations. Learn about training prompts, digital audio concepts, and audio codecs used for synthesizing powerful and dynamic songs akin to a thunderstorm. Dive into the details of model structures and the possibilities of digital audio processing.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Prompt Prompt- -Singer: Controllable Singing Singer: Controllable Singing- - Voice Voice- -Synthesis with Natural Synthesis with Natural Language Prompt Language Prompt Presented by Fischer Yeh R12942065
Problem Define & Model Structure OUTPUT Singing-Voice- Synthesis (SVS) with conditions INPUT FIRST!!
Natural Language Prompt https://prompt-singer.github.io/ Specific Objectives How To Get Labels and Prompts for Training?.........
Template Ex: Generate a song by a [gender] singer
Training Prompt ? # Takes place dynamically during training Extra prompt: Could you synthesize a song that s as powerful as a thunderstorm? (large volume)
Model structure OUTPUT INPUT
Prompt encoder Text encoder to extract a semantic representation (FLAN-T5, BERT and CLAP) Subjective Relevance to the prompt Labels task (accuracy) Pitch Quality
Model structure OUTPUT INPUT
What is digital? 16 bits or 32 bits => 2^16 = 65536 possibilities in 1/16000 sec!! Source: https://qph.cf2.quoracdn.net/main-qimg-09d3aff2a52e7f46b2b186f7f481ae1a
Audio codec If sampling and quantizing once is not sufficient, do it again! 1024 possibilities in 1/100 sec codebook 1 C Frame 1 2 3 T
Audio codec SoundStream (RVQ type) Produce discrete compressed representations of audio, and these representations can be used to reconstruct waveforms with the decoder. codebook 1 C a Frame 1 2 3 T a = 0 ~ Ka 1, Kais codebook size
Model structure OUTPUT INPUT
Decoupled Pitch Representation Decompose F0 into two components: Average F0 and Rescaled F0 Sequence Sing the same song with different pitch Male2 Female1 Female2 Male1 f F1 F4 F2 F3 Enable prompt- conditioned voice range manipulation while keeping melodic accuracy F4 F1 F3 F2
Model structure OUTPUT INPUT
Lyrics Representation Lyrics Duration (Labeled) Phonemize & Forced-alignment w o ai n ii Phoneme Duration Expand ww ooooooooo aiai ai n iiii ii Note: Each phoneme represents one frame
Model structure OUTPUT INPUT
Multi-Scale Transformer Architecture 100M Decoder-only transformers For non-acoustic modalities, each item is repeated C times to fit this modeling mechanism 320M Autoregressively EX: C = 3 Trained with 6 NVIDIA-V100 gpus for about 4-5 days Frame 1 Frame 2 Frame 3
Alleviating Data Scarcity Subjective Relevance to the prompt Labels task (accuracy) Pitch Quality