Multi-Modal Speech Enhancement with BiNet and Contrastive Learning

Author: Changtao Li
Email: lichangtao@mail.ioa.ac.cn

Abstract

This paper investigates the joint use of bone-conducted and air-conducted speech in a multimodal speech enhancement framework. Starting from the backbone network architecture, we design a temporal two-tower network named after BiNet, capable of directly processing both bone-conducted and noisy air-conducted speech inputs. BiNet employs two independent encoders to map bone-conducted and noisy air-conducted speech into a shared embedding space. A decoder is then utilized to reconstruct the target clean speech from the embedding features of both modalities. Additionally, skip connections are incorporated in BiNet to better capture the long-term and short-term temporal correlations in speech. Considering the significance of spectral components in speech perception, we adopt a multi-scale mel-spectrogram loss function as the training objective, which encourages the network to generate more plausible spectral details of the desired speech. The aforementioned backbone network design allows us to apply regularization constraints based on contrastive learning. By controlling the similarity between the embedding features of bone-conducted and noisy air-conducted speech, we impose two regularization constraints on BiNet. When the embedding features of these two modalities exhibit higher similarity, the proposed BiNet achieves superior speech enhancement performance. Extensive experiments conducted on a recorded dataset of bone-conducted/air-conducted speech validate our approach. By combining the proposed model with contrastive learning regularization constraints, our method outperforms baseline models and several recent multimodal speech enhancement systems in terms of PESQ and STOI metrics.

Samples

We encourage our readers to listen to the following audio samples in order to experience the audio quality of our enhanced speech.

-10 dB	air-conducted speech	bone-conducted speech	noisy speech	UNet restoration	UNet denoising
	DCCRN	UNet early fusion	involution	BiNet w/o regularization	BiNet w/ regularization

-5 dB	air-conducted speech	bone-conducted speech	noisy speech	UNet restoration	UNet denoising
	DCCRN	UNet early fusion	involution	BiNet w/o regularization	BiNet w/ regularization

0 dB	air-conducted speech	bone-conducted speech	noisy speech	UNet restoration	UNet denoising
	DCCRN	UNet early fusion	involution	BiNet w/o regularization	BiNet w/ regularization

5 dB	air-conducted speech	bone-conducted speech	noisy speech	UNet restoration	UNet denoising
	DCCRN	UNet early fusion	involution	BiNet w/o regularization	BiNet w/ regularization