Comprehensive Overview of Singer Voice Synthesis and Model Structures

Slide Note
Embed
Share

Explore the world of controllable singing voice synthesis with natural language prompts. Understand the model structures for singing voice synthesis (SVS) and how text encoders extract semantic representations. Learn about training prompts, digital audio concepts, and audio codecs used for synthesizing powerful and dynamic songs akin to a thunderstorm. Dive into the details of model structures and the possibilities of digital audio processing.


Uploaded on Aug 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Prompt Prompt- -Singer: Controllable Singing Singer: Controllable Singing- - Voice Voice- -Synthesis with Natural Synthesis with Natural Language Prompt Language Prompt Presented by Fischer Yeh R12942065

  2. Problem Define & Model Structure OUTPUT Singing-Voice- Synthesis (SVS) with conditions INPUT FIRST!!

  3. Natural Language Prompt https://prompt-singer.github.io/ Specific Objectives How To Get Labels and Prompts for Training?.........

  4. Template Ex: Generate a song by a [gender] singer

  5. Training Prompt ? # Takes place dynamically during training Extra prompt: Could you synthesize a song that s as powerful as a thunderstorm? (large volume)

  6. Model structure OUTPUT INPUT

  7. Prompt encoder Text encoder to extract a semantic representation (FLAN-T5, BERT and CLAP) Subjective Relevance to the prompt Labels task (accuracy) Pitch Quality

  8. Model structure OUTPUT INPUT

  9. What is digital? 16 bits or 32 bits => 2^16 = 65536 possibilities in 1/16000 sec!! Source: https://qph.cf2.quoracdn.net/main-qimg-09d3aff2a52e7f46b2b186f7f481ae1a

  10. Audio codec If sampling and quantizing once is not sufficient, do it again! 1024 possibilities in 1/100 sec codebook 1 C Frame 1 2 3 T

  11. Audio codec SoundStream (RVQ type) Produce discrete compressed representations of audio, and these representations can be used to reconstruct waveforms with the decoder. codebook 1 C a Frame 1 2 3 T a = 0 ~ Ka 1, Kais codebook size

  12. Model structure OUTPUT INPUT

  13. Decoupled Pitch Representation Decompose F0 into two components: Average F0 and Rescaled F0 Sequence Sing the same song with different pitch Male2 Female1 Female2 Male1 f F1 F4 F2 F3 Enable prompt- conditioned voice range manipulation while keeping melodic accuracy F4 F1 F3 F2

  14. Model structure OUTPUT INPUT

  15. Lyrics Representation Lyrics Duration (Labeled) Phonemize & Forced-alignment w o ai n ii Phoneme Duration Expand ww ooooooooo aiai ai n iiii ii Note: Each phoneme represents one frame

  16. Model structure OUTPUT INPUT

  17. Multi-Scale Transformer Architecture 100M Decoder-only transformers For non-acoustic modalities, each item is repeated C times to fit this modeling mechanism 320M Autoregressively EX: C = 3 Trained with 6 NVIDIA-V100 gpus for about 4-5 days Frame 1 Frame 2 Frame 3

  18. Alleviating Data Scarcity Subjective Relevance to the prompt Labels task (accuracy) Pitch Quality

  19. Inference latency

More Related Content