Jagabandhu Mishra, Amartya Chowdhury, S. R. Mahadeva Prasanna
Keyword:
Electrical Engineering and Systems Science, Audio and Speech Processing, Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Sound (cs.SD)
journal:
--
date:
2023-06-21 16:00:00
Abstract
Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedding vectors. Hence this work initially explores the available speaker diarization frameworks that capture speaker information implicitly to perform LD tasks. The performance of the LD system on synthetic code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and for practical data is 22.50% and 60.38%, in terms of diarization error rate and Jaccard error rate (JER), respectively. The performance degradation is due to the data imbalance and resolved to some extent by using pre-trained wave2vec embeddings that provide a relative improvement of 30.74% in terms of JER.