Moritz Böhle: Computationally efficient neural network architectures for speech separation and their application to hearing aids

BCCN Berlin and TU Berlin

 

Abstract

In March of 2018, the World Health Organisation has estimated that currently around 466 million people worldwide have disabling hearing loss, a number that is expected to rise by almost 30 million people in the next four decades. While the quality of hearing aids keeps improving and many people rely on them to mitigate their hearing loss, it is often reported that these devices are of limited use in situations with many interfering sound sources – the hard of hearing seem to have problems to filter out the desired target sound from the amplified sound signal. In order to circumvent this problem, it would be desirable to preprocess the incoming sound before amplification such that the amplified signal only conveys the target sound, e.g. the speech of the person one is talking to.
For many years, the problem of separating speech from background noise has proven to be very hard to solve. However, recent advances in the field of deep learning have allowed for a lot of progress regarding this issue and the quality of the resulting filtered speech signals is already very good.
Nevertheless, deep learning models are notoriously expensive in terms of computations and memory requirements and are commonly trained and evaluated on computationally powerful machines. Therefore, in this project, we investigated two types of neural networks for speech separation in terms of their efficiency and by means of a detailed analysis of the networks, the influence of different structural elements on the efficiency of a given model could be distilled. All of our networks process the audio input in the Short-Time Fourier Transform (STFT) space, as is commonly done in speech processing tasks. While these STFT inputs are usually processed with 2D convolutional networks, we highlight advantages of not using convolutions along the frequency dimension: first, we argue that applying convolutions along this dimension implies a translational invariance that is not present in the data and that this could actually be detrimental to the training process. Second, we show that not using convolutions along the frequency dimension entails a significant gain in efficiency. In detail, we show that by changing the convolutional model to only use convolutions in the time dimension, in our setting, we can save up to 97% of memory write operations and reduce the number of required Floating Point Operations (FLOPs) by 95% per convolutional layer when predicting one new STFT output vector, all while increasing the number of filters by a factor of 8 per layer. Additionally, it allows us to increase the number of layers whilst still requiring only 88% of the total memory. Of the total memory, in turn, we increase the amount of memory that is used to store trainable model parameters from 4% to 96%; the remaining memory is used to store previous activation values for the individual layers. Despite using much fewer resources than the model that also applies convolution operations in the frequency dimension, the more efficient model also achieves better performance in terms of two commonly employed quality measures. Of the three models we analyzed, only one model did not use convolution operations along the frequency dimension; given the significant gains in efficiency, this was the only model that would be able to run in real time on a Snapdragon 835 mobile CPU.

Organized by

Henning Sprekeler / Robert Martin

Location

BCCN Berlin, lecture hall, Philippstr. 13 Haus 6, 10115 Berlin



Go back