Análise de arquiteturas baseadas em transformers na transcrição de fala e descrição de áudio de fundo simultâneos em cenários sonoros mistos

Silva, João Vitor Roriz da

Análise de arquiteturas baseadas em transformers na transcrição de fala e descrição de áudio de fundo simultâneos em cenários sonoros mistos

Arquivos

JoaoVitorRorizdaSilva-2025-dissertacao.pdf(3.08 MB)

Data

2025-03-26

Autores

Silva, João Vitor Roriz da

Editor

Universidade Federal do Espírito Santo

Resumo

This work investigates how two specialized neural networks—a speech transcription model (Whisper) and a general audio captioning model (Prompteus)—can be jointly leveraged to process mixed audio inputs containing both speech and non-speech events. We construct the Clotho Voice dataset by merging speech recordings from the Common Voice 5.1 corpus and general sounds from the Clotho 2.1 dataset. Through a series of controlled experiments, we examine how each model’s performance degrades when presented with overlapping speech and background sounds. Results show that Whisper excels at transcription when speech dominates the input signal, yet its accuracy diminishes in the presence of substantial non speech noise. Conversely, Prompteus demonstrates high performance in purely background oriented settings but exhibits a decline in descriptive capability as speech levels increase. We also highlight how preprocessing steps—such as normalization and resampling—impact borderline cases, revealing that subtle audio features are crucial for robust event detection in challenging acoustic environments. Our findings underscore the importance of tailored training and data augmentation strategies to mitigate performance loss in mixed audio scenarios. By integrating the complementary strengths of speech-focused and background focused models, we offer a pathway toward more comprehensive audio understanding systems suitable for noisy, real-world applications, including industrial automation and assistive technologies. This research paves the way for developing hybrid frameworks that capture both spoken language and context-rich environmental cues in a single, unified approach

Palavras-chave

Transcrição automática de fala , Legendas automáticas de fundo , Legendas automáticas de fundo , Descrição automática de áudio , Descrição automática de áudio , Whisper , Whisper , Automatic speech transcription , Automatic speech transcription , Automatic background captioning , Automatic background captioning , Automatic audio description , Automatic audio description

URI

http://repositorio.ufes.br/handle/10/19765

Coleções

Mestrado em Informática

Página do item completo