0%

who we are

Audio & Speech Research Group is a research group under the Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences. The Principal Investigator of this research group is Prof. Feiran Yang (杨飞然). Our research interests include Adaptive Filtering, Blind Source Separation, Array Signal Processing, Voice Anti-Spoofing, Acoustic Echo Cancellation, Speech Enhancement. See HERE

The members of group includes Feiran Yang (杨飞然), Yi Wan (万伊), Taihui Wang (王泰辉), Shengdong Liu (刘升东), Zhengqiang Luo (罗正强), Yang Liu (刘杨), Jinfu Wang (王劲夫), Changtao Li (李长涛), Jing Lei (雷菁), Kelan Kuang (匡柯澜), Qing Shi(石擎), Lan Tang (汤澜), Cong Zhang(张聪), Haobo Jia (贾浩博), Dianzhe Ding (丁殿哲).

Author: Kelan Kuang
Email: kuangkelan@mail.ioa.ac.cn

音频文件质量主观评价测试

感谢参加本次音频文件主观测试。本次测试分为两个部分,首先是对来自不同系统的同一条语音进行打分(mean opinion score;MOS);接着受试者需要从两条对比的语音中选择质量更好的一条(paired comparison test)。这两部分测试的具体内容会在每一部分测试的开始进行具体介绍。

MOS

基准语音 1分 2分 3-4分 5分

请你对以下语音的可懂度进行打分,打分范围是1-5。其中5分表示语音可懂度很高,主观上完全可以接受,语音完全不刺耳。4分表示对语音的接受程度降低,但不反感该语音。3分表示可以感知到此语音的瑕疵,如电音等刺耳瑕疵。2分表示该语音的刺耳瑕疵非常明显。1分表示完全无法接受该语音且非常反感。我们给出了1分与5分的语音作为参考,你只需要聆听这些语音以获得语音质量的直观感受。请您以这两条语音作为基准,对以下语音样本进行打分。需要说明的是本次测试是为了对语音质量以及可懂度而非语音的内容进行评判,某些语音的口音可能比较重,请您忽略口音相关的问题。下面的每一个Group,例如Group1,包含来自4个不同的语音增强系统(s1-s4)的输出语音,请在评分后将分数记录在对应的Excel表格中。

极低信噪比:

Group1 s1 s2 s3 s4
Group2 s1 s2 s3 s4
Group3 s1 s2 s3 s4
Group4 s1 s2 s3 s4
Group5 s1 s2 s3 s4
Group6 s1 s2 s3 s4
Group7 s1 s2 s3 s4
Read more »

Author: Changtao Li
Email: lichangtao@mail.ioa.ac.cn

Abstract

This paper investigates the joint use of bone-conducted and air-conducted speech in a multimodal speech enhancement framework. Starting from the backbone network architecture, we design a temporal two-tower network named after BiNet, capable of directly processing both bone-conducted and noisy air-conducted speech inputs. BiNet employs two independent encoders to map bone-conducted and noisy air-conducted speech into a shared embedding space. A decoder is then utilized to reconstruct the target clean speech from the embedding features of both modalities. Additionally, skip connections are incorporated in BiNet to better capture the long-term and short-term temporal correlations in speech. Considering the significance of spectral components in speech perception, we adopt a multi-scale mel-spectrogram loss function as the training objective, which encourages the network to generate more plausible spectral details of the desired speech. The aforementioned backbone network design allows us to apply regularization constraints based on contrastive learning. By controlling the similarity between the embedding features of bone-conducted and noisy air-conducted speech, we impose two regularization constraints on BiNet. When the embedding features of these two modalities exhibit higher similarity, the proposed BiNet achieves superior speech enhancement performance. Extensive experiments conducted on a recorded dataset of bone-conducted/air-conducted speech validate our approach. By combining the proposed model with contrastive learning regularization constraints, our method outperforms baseline models and several recent multimodal speech enhancement systems in terms of PESQ and STOI metrics.

Samples

We encourage our readers to listen to the following audio samples in order to experience the audio quality of our enhanced speech.

-10 dB air-conducted speech bone-conducted speech noisy speech UNet restoration UNet denoising
DCCRN UNet early fusion involution BiNet w/o regularization BiNet w/ regularization
-5 dB air-conducted speech bone-conducted speech noisy speech UNet restoration UNet denoising
DCCRN UNet early fusion involution BiNet w/o regularization BiNet w/ regularization
0 dB air-conducted speech bone-conducted speech noisy speech UNet restoration UNet denoising
DCCRN UNet early fusion involution BiNet w/o regularization BiNet w/ regularization
5 dB air-conducted speech bone-conducted speech noisy speech UNet restoration UNet denoising
DCCRN UNet early fusion involution BiNet w/o regularization BiNet w/ regularization

Author: Changtao Li
Email: lichangtao@mail.ioa.ac.cn

Abstract

Bone-conducted speech is not susceptible to background noise but suffers from poor speech quality and intelligibility due to the limited bandwidth. This paper proposes a two-stage approach to restore the quality of bone-conducted speech, namely, bandwidth extension and speech vocoder. In the first stage, a deep neural network is trained to learn mappings from a low-resolution representation of the bone-conducted speech, i.e., log Mel-scale spectrogram, to that of the air-conducted speech, which extends the bandwidth of the bone-conducted speech. In the second stage, a speech vocoder is employed to transform the extended log Mel-scale spectrogram of the bone-conducted speech back to time-domain waveforms. Due to the many-to-many correspondence between the air-conducted and bone-conducted speech, supervised learning may not be the best training protocol for the bone-conducted/air-conducted feature mapping. We thus propose to leverage adversarial training to further improve the bandwidth extension performance in the first stage. The two stages are decoupled and can be trained independently. The vocoder is trained on a large multi-speaker dataset and can generalize well to unknown speakers. Also, the vocoder can help to remedy the spectral artifacts introduced in the bandwidth extension stage. Objective and subjective evaluations on ESMB dataset show that the proposed two-stage system substantially outperforms existing bone-conducted speech enhancement systems.

Samples

We encourage our readers to listen to the following audio samples in order to experience the audio quality of our enhanced speech. Due to the large number of audio files, the loading time might be prolonged. Kindly wait for a moment or consider closing other web pages and refreshing the website to improve the loading speed.

Sample1 DCCRN-AF DCCRN-BC S2 DPT-EGNet
Proposed S1 WaveNet BC
AC
Sample2 DCCRN-AF DCCRN-BC S2 DPT-EGNet
Proposed S1 WaveNet BC
AC
Read more »

Author: Changtao Li
Email: lichangtao@mail.ioa.ac.cn

Restoration of Bone-Conducted Speech with U-Net-like Model and Energy Distance Loss

Audio samples

We encourage our readers to listen to the following audio samples in order to experience the audio quality of our enhanced speech.

Signals sample1 sample2 sample3 sample4
Bone-conducted Speech
Air-conducted Speech
DPT-EGNet
EBEN
Ours (l1 loss)
Ours (energy loss)

More samples

For more samples, See https://github.com/changtaoli/model-energy

作者: Taihui Wang
Email: wangtaihui@mail.ioa.ac.cn

Convolutive Transfer Function-Based Multichannel Nonnegative Matrix Factorization for Overdetermined Blind Source Separation

Taihui Wang 1, 2 , Feiran Yang 3 , Member, IEEE, and Jun Yang 1, 2 , Senior Member, IEEE

1 Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing,
China

2 University of Chinese Academy of Sciences, Beijing, China

3 State Key Laboratory of Acoustics, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China

Abstract

Most multichannel blind source separation (BSS) approaches rely on a spatial model to encode the transfer functions from sources to microphones and a source model to encode the source power spectral density. The rank-1 spatial model has been widely exploited in independent component analysis (ICA), independent vector analysis (IVA), and independent low-rank matrix analysis (ILRMA). The full-rank spatial model is also considered in many BSS approaches, such as full-rank spatial covariance matrix analysis (FCA), multichannel nonnegative matrix factorization (MNMF), and FastMNMF, which can improve the separation performance in the case of long reverberation times. This paper proposes a new MNMF framework based on the convolutive transfer function (CTF) for overdetermined BSS. The time-domain convolutive mixture model is approximated by a frequency-wise convolutive mixture model instead of the widely adopted frequency-wise instantaneous mixture model. The iterative projection algorithm is adopted to estimate the demixing matrix, and the multiplicative update rule is employed to estimate nonnegative matrix factorization (NMF) parameters. Finally, the source image is reconstructed using a multichannel Wiener filter. The advantages of the proposed method are twofold. First, the CTF approximation enables us to use a short window to represent long impulse responses. Second, the full-rank spatial model can be derived based on the CTF approximation and slowly time-variant source variances, and close relationships between the proposed method and ILRMA, FCA, MNMF and FastMNMF are revealed. Extensive experiments show that the proposed algorithm achieves a higher separation performance than ILRMA and FastMNMF in reverberant environments.

Separated audio samples

The following table shows an example of the 2-music source separation task, where the public NMF model is used and the number of bases is set to 32. The reverberation time is 470 ms, and the channel number is 6.

Source 1 Source 2
Clean reverberant sources
Mixture at the first microphone
Separated sources uing CTF-MNMF
Read more »

作者: Kelan Kuang
Email: kuangkelan@mail.ioa.ac.cn

Multi-Channel Speech Enhancement results

This repo shows enhanced speech of TriU-Net and other models.

Real recordings

SNR: -5 dB

Spk 1 is a 24-year-old male, while spk 2 is a 24-year-old female. Both spk 1 and spk 2 speek in Mandarin.

Signals Spk 1 babble Spk 2 babble Spk 1 f16 Spk 2 f16
noisy
clean
oracl MB-MVDR
EaBNet
FaSNet
Model A
TriU-Net
Read more »