2024

Multi-View Spectrogram Transformer for Respiratory Sound Classification

He, Wentao, Yan, Yuchen, Ren, Jianfeng, Bai, Ruibin, and Jiang, Xudong

Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different-sized patches, representing the multi-view acoustic elements of a respiratory sound. The patches and positional embeddings are fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds. The code is available at: https://github.com/wentaoheunnc/MVST.

SpectrogramComputer scienceTransformerEncoderSpeech recognitionArtificial intelligencePattern recognition (psychology)Engineering

Ruibin Bai

Director of Lab

Computer Science and Operations Research

View on Publisher Site

Multi-View Spectrogram Transformer for Respiratory Sound Classification

Abstract

Keywords

Authors from this organization

Ruibin Bai