Análise de arquiteturas baseadas em transformers na transcrição de fala e descrição de áudio de fundo simultâneos em cenários sonoros mistos

Nenhuma Miniatura disponível
Data
2025-03-26
Autores
Silva, João Vitor Roriz da
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Espírito Santo
Resumo
This work investigates how two specialized neural networks—a speech transcription model (Whisper) and a general audio captioning model (Prompteus)—can be jointly leveraged to process mixed audio inputs containing both speech and non-speech events. We construct the Clotho Voice dataset by merging speech recordings from the Common Voice 5.1 corpus and general sounds from the Clotho 2.1 dataset. Through a series of controlled experiments, we examine how each model’s performance degrades when presented with overlapping speech and background sounds. Results show that Whisper excels at transcription when speech dominates the input signal, yet its accuracy diminishes in the presence of substantial non speech noise. Conversely, Prompteus demonstrates high performance in purely background oriented settings but exhibits a decline in descriptive capability as speech levels increase. We also highlight how preprocessing steps—such as normalization and resampling—impact borderline cases, revealing that subtle audio features are crucial for robust event detection in challenging acoustic environments. Our findings underscore the importance of tailored training and data augmentation strategies to mitigate performance loss in mixed audio scenarios. By integrating the complementary strengths of speech-focused and background focused models, we offer a pathway toward more comprehensive audio understanding systems suitable for noisy, real-world applications, including industrial automation and assistive technologies. This research paves the way for developing hybrid frameworks that capture both spoken language and context-rich environmental cues in a single, unified approach
Descrição
Palavras-chave
Transcrição automática de fala , Legendas automáticas de fundo , Legendas automáticas de fundo , Descrição automática de áudio , Descrição automática de áudio , Whisper , Whisper , Automatic speech transcription , Automatic speech transcription , Automatic background captioning , Automatic background captioning , Automatic audio description , Automatic audio description
Citação