background
logo
ArxivPaperAI

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Author:
Bohan Wang, Damien Ronssin, Milos Cernak
Keyword:
Electrical Engineering and Systems Science, Audio and Speech Processing, Audio and Speech Processing (eess.AS), Machine Learning (cs.LG), Sound (cs.SD)
journal:
--
date:
2023-05-31 16:00:00
Abstract
This paper presents ALO-VC, a non-parallel low-latency one-shot phonetic posteriorgrams (PPGs) based voice conversion method. ALO-VC enables any-to-any voice conversion using only one utterance from the target speaker, with only 47.5 ms future look-ahead. The proposed hybrid signal processing and machine learning pipeline combines a pre-trained speaker encoder, a pitch predictor to predict the converted speech's prosody, and positional encoding to convey the phoneme's location information. We introduce two system versions: ALO-VC-R, which uses a pre-trained d-vector speaker encoder, and ALO-VC-E, which improves performance using the ECAPA-TDNN speaker encoder. The experimental results demonstrate both ALO-VC-R and ALO-VC-E can achieve comparable performance to non-causal baseline systems on the VCTK dataset and two out-of-domain datasets. Furthermore, both proposed systems can be deployed on a single CPU core with 55 ms latency and 0.78 real-time factor. Our demo is available online.
PDF: ALO-VC: Any-to-any Low-latency One-shot Voice Conversion.pdf
Empowered by ChatGPT