AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation

Author:

Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Keyword:

Electrical Engineering and Systems Science, Audio and Speech Processing, Audio and Speech Processing (eess.AS)

journal:

date:

2023-11-21 00:00:00

Abstract

Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with multi-task learning of acoustic tasks. Specifically, we propose a joint training network, achieved by fine-tuning a large audio model based on the pre-trained hierarchical token-semantic audio Transformer. We then leverage LLMs to craft audio logs that summarize textual descriptions of the acoustic environment. Experiments show that the proposed system attains exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analyses demonstrate AudioLog's power in effectively summarizing long audio sequences.

PDF: AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation.pdf