Speech enhancement for non-stationary noise around a machine cabin Master’s thesis in Sound and Vibration YILIANG ZHOU DEPARTMENT OF Architecture and Civil Engineering Division of Applied Acoustics CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Speech enhancement for non-stationary noise around a machine cabin YILIANG ZHOU Department of Architecture and Civil Engineering Division of Applied Acoustics Chalmers University of Technology Gothenburg, Sweden 2023 Speech enhancement for non-stationary noise around a machine cabin YILIANG ZHOU © YILIANG ZHOU, 2023. Supervisor: Frenne Nicklas, Volvo Construction Equipment AB Examiner: Jens Ahrens, Division of Applied Acoustics Master’s Thesis 2023 Department of Architecture and Civil Engineering Division of Applied Acoustics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: The various sound sources around the testing VOLVO L60H wheel loader. The illustration picture is from the previous thesis student Tomoya Otsuka. Image URL: https://www.volvoce.com/europe/en/products/wheel-loaders/l60h/ Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Speech enhancement for non-stationary noise around a machine cabin YILIANG ZHOU Division of Applied Acoustics Chalmers University of Technology Abstract This thesis is mainly concerned with the solution for speech enhancement in the presence of non-stationary noise around the machine cabin. This allows outside speech to enter the cabin and reduces unwanted noise. The application scenario of this work is a signal processing system using a microphone array outside the cabin to capture signals and using different algorithms to enhance the speech sig- nals. With the help of this system, the machine operator is able to get speech information in a noisy environment. The noise sources in this situation are more complex and non-stationary. Examples of noise sources include engine noise, traffic noise, or other construction activities. Previous work done by Tomoya chose the microphone array configuration and developed the beamforming method. From the results of the valuable work, it is found that beamforming is able to increase the signal-to-noiseratio (SNR) in the current situation, but the sound quality and SNR are still limited due to the low input SNR and non-stationary noise environment. Therefore, modified beamforming and new methods are implemented in this work. noise cancellation(NC) predicts the transfer path for the noise signal and removes it by controlling the minimum error of the output. Noise suppression(NS) uses the scheme of spectral subtraction to subtract the noise spectrum from noisy speech spectrum. A combination of beamforming and noise cancellation and a combination of beamforming and noise suppression method are developed and evaluated. The result shows a better performance for this low input SNR and non-stationary noise case. Keywords: Speech enhancement, Non-stationary noise environment, Beamforming, Noise cancellation, Noise estimation. v Acknowledgements I appreciate all the valuable advice and instructions from Nicklas Frenne, my su- pervisor at Volvo Construction Equipment AB. He was always active and helpful during the whole thesis program. It was a pleasure to work with him and Volvo Construction Equipment. I would like to show my appreciation to Jens Arhens, my supervisor and examiner at Chalmers. He has given me a lot of support and sug- gestions for my thesis. Thank you Sophie Poulsen(my dear girlfriend) for patiently pushing me to finish this thesis faster. Finally, I wish to thank my family and friends for their unconditional love and encouragement throughout my life. Yiliang Zhou, Copenhagen, October 2023 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: SNR Signal-to-Noise Ratio STFT short-time Fourier transform NC Noise Cancellation FRF Frequency Response Function FFT Fast Fourier Transform IFFT Inverse Fast Fourier Transform LMS Least Mean Squares FXLMS Filtered-X Least Mean Squares NS Noise Suppression CNNs Convolutional Neural Networks RNN Recurrent Neural Networks LSTM Long Short-Term Memory MCRA Minima Controlled Recursive Averaging PSD Power Spectral Density LLR Log-Likelihood Ratio WSS Weighted Spectral Slope SPP Speech Presence Probability PESQ Perceptual Evaluation of Speech Quality ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Previous Work and Setup . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Stationary noise and Non-Stationary noise . . . . . . . . . . . . . . . 5 2.1.1 Stationary Noise . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Non-Stationary Noise . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Noise cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Noise suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 General spectral-subtractive speech enhancement configuration 11 2.4.3 Minima Controlled Recursive Averaging (MCRA) noise esti- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Musical noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Signal-to-Noise Ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . 14 3 Methods 15 3.1 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Noise cancellation with Beamforming . . . . . . . . . . . . . . . . . . 16 3.3 Noise suppression with beamforming . . . . . . . . . . . . . . . . . . 17 4 Results 19 4.1 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Beamforming SNR . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Robustness check by moving the source position . . . . . . . . 20 4.2 Noise cancellation with beamforming . . . . . . . . . . . . . . . . . . 23 4.2.1 Spectrogram comparison . . . . . . . . . . . . . . . . . . . . . 23 xi Contents 4.2.2 Robustness check by changing the input noise percentage . . . 24 4.3 MCRA noise suppression with beamforming . . . . . . . . . . . . . . 26 4.3.1 Spectrogram comparison and Coherence . . . . . . . . . . . . 27 4.3.2 Robustness check by changing the input noise percentage . . . 29 4.4 Informal listening for different methods . . . . . . . . . . . . . . . . . 30 5 Conclusion 31 5.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Real-world applications . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Bibliography 33 A Appendix 1 I xii List of Figures 1.1 Optimal microphone positions from the previous work, drawing, and the real picture. Image: Nicklas Frenne, March 22, 2022 . . . . . . . 3 2.1 Delay-sum beamforming diagram. Image from[11] . . . . . . . . . . . 7 2.2 Noise Cancellation theory in current case[12] . . . . . . . . . . . . . . 8 2.3 General form of the spectral subtraction algorithm. Flow chart from [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Theory of MCRA algorithm.Flow chart from[10] . . . . . . . . . . . . 12 3.1 Flow chart of beamforming algorithm . . . . . . . . . . . . . . . . . . 15 3.2 Flow chart of Noise cancellation with Beamforming algorithm . . . . 16 3.3 Flow chart of noise suppression with beamforming algorithm . . . . . 17 4.1 Beamforming target point and microphone positions . . . . . . . . . . 19 4.2 Output SNR when the source is moving in the blue square(0.5m shift- ing from center). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Output SNR when the source is moving in the blue square(1m shifting from center). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Output SNR when the source is moving in the blue square(3m shifting from center). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5 Spectrogram of noise cancellation with beamforming algorithm . . . . 23 4.6 Robustness check for first Noise suppression then beamforming. . . . 25 4.7 Spectrogram of noise suppression with beamforming algorithm . . . . 27 4.8 Coherence of clean speech and the result of different methods . . . . . 28 4.9 Robustness check for first noise suppression then beamforming. . . . . 29 5.1 Control Panel for Application . . . . . . . . . . . . . . . . . . . . . . 32 xiii List of Figures xiv List of Tables 4.1 Comparison of SNR between single microphone and beamforming output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Subjective evaluation for different methods. . . . . . . . . . . . . . . 30 xv List of Tables xvi 1 Introduction Speech communication is essential in many environments, including vehicle cabins. However, background noise and engine noise can interfere with speech intelligibility and pose challenges to clear communication between the machine operator and other individuals. At Volvo Construction Equipment AB, some machines are equipped with cabins that are well acoustically isolated from the environment. In the cabin, normally a good noise reduction has been achieved which means the noise and sound from outside are isolated. However, there is a need to allow specific sound sources, such as speech, to be passed through. To address this problem, an important point is to use a system to capture mixed signals and process the signal to get clear signals, ie. Speech signals. Previous work has been done using a 10-microphone array and beamforming to develop a system for speech enhancement and noise reduction. However, the results were not optimal in this low-input SNR and non-stationary noise situation. Therefore, the aim of this thesis is to optimize and complete the signal-processing system implement new methods for speech enhancement and noise reduction in the vehicle cabin try to achieve better performance, also investigate the limitations of different methods. 1.1 Purpose By improving the existing system, this research will contribute to enhancing speech intelligibility and enabling clear communication in noisy vehicle cabins. This thesis will explore and compare different signal processing techniques and their limits to optimize the performance of the system, with a focus on improving speech qual- ity and reducing noise. The results of this research will be valuable for developing effective communication systems in noisy environments, which can have important applications in various industries, including communication, construction, and trans- portation. 1.2 Related Work Ideally, we would like a speech enhancement process to improve both quality and intelligibility. It is possible to reduce the background noise but at the expense of introducing speech distortion, which in turn may impair speech intelligibility. Hence, the main challenge in designing effective speech enhancement algorithms is to suppress noise without introducing any perceptible distortion in the signal. Thus far, most speech enhancement algorithms have been found to improve only 1 1. Introduction the quality of speech [10]. This is also a limitation of this thesis that we are getting less improvement in speech intelligibility. The solution to the general problem of speech enhancement depends largely on the application at hand, the number of microphones or sensors available, the relationship (if any) of the noise to the clean signal, and the characteristics of the noise source or interference. The interference could be noise-like (e.g., fan noise) or speech-like such as an environment (e.g. a restaurant) with competing speakers [3]. Furthermore, the number of microphones available can influence the performance of speech enhancement algorithms. Typically, the larger the number of microphones, the easier the speech enhancement task becomes. Adaptive cancellation techniques can be used when at least one microphone is near the noise source. The noise may also be statistically correlated or uncorrelated with the clean speech signal, like the minimum mean-square error information[4]. Regarding the on-site situation in this thesis and these three facts, three kinds of corresponding methods are implemented. They are Beamforming, noise cancellation, and noise suppression. 1.3 Previous Work and Setup Previous work was done by Tomoya by exploring the optimal arrangement for beam- forming microphone positions and recording relevant noise and speech signals. Com- parison between different microphone arrangements was evaluated and the final choice had the best performance among all. The optimal setup is shown on the left side of Figure. 1.1, and 6 locations are selected around the cabin for measuring. There are 10 microphones in all in the beamforming setup. The microphone setup is symmetric along the Y-axis. The names are marked on the right side of Figure. 1.1. (Front Right 1 and 2, Rear Right 1 and 2, and Corner West 1, on the other side they are Front Left 1 and 2, Rear Left 1 and 2, and Corner East 1). After choosing the optimal position, three measurements were conducted: frequency response function measurements, speech, and machine recordings. There are 6 dif- ferent locations where the source was placed and measurements of the frequency response functions and recordings of noise speech were taken. The measurement plan is shown in the Appendix A. After getting all the measurements, the delay and sum beamforming method was used to implement the signal processing process. SNR improvement is shown in the Result Chapter 4. 2 1. Introduction Figure 1.1: Optimal microphone positions from the previous work, drawing, and the real picture. Image: Nicklas Frenne, March 22, 2022 1.4 Structure of the thesis Following the introduction, chapter 2 illustrates the theory that is needed in this the- sis. Theory of different methods and theory for speech quality evaluation. Chapter 3 gives the methodology that was applied in conducting this thesis. Beamform- ing, noise cancellation, and noise suppression are used individually and combined. Chapter 4 will display the results of different methods, both objective and subjective results will be shown. Also, the robustness test result is included in this chapter. Further discussions on the results and limitations include elements that could have been handled differently, and suggestions of real applications are included in Chapter 5. 3 1. Introduction 4 2 Theory In most applications, the aim of speech enhancement is to improve the quality and intelligibility of degraded speech. The improvement in quality is highly desirable as it can reduce listener fatigue, particularly in situations where the listener is exposed to high levels of noise for long periods of time (e.g., manufacturing).[10] 2.1 Stationary noise and Non-Stationary noise Noise is an unwanted random variation that can corrupt or degrade a signal. Under- standing the characteristics of noise is crucial for developing effective noise reduction techniques. Noise can be broadly categorized into two types: stationary noise and non-stationary noise. 2.1.1 Stationary Noise Stationary noise is a type of noise that exhibits a constant statistical property over time. In other words, the statistical parameters of stationary noise, such as mean and variance, remain constant or change very slowly over short time intervals. Stationary noise is often characterized by its spectral properties, which can be represented by a power spectral density (PSD) that remains relatively constant across time. Common examples of stationary noise include white noise, machine noise, and certain types of background hum. Mathematically, if we denote the speech signal corrupted by stationary noise as s[t] and the stationary noise component as v[t], we can model the observed signal x[t] as the sum of the two: x[t] = s[t] + v[t] (2.1) The stationary noise v[t] is often assumed to be uncorrelated with the speech signal s[t] and can be represented by a time-invariant noise PSD N(f). This assumption allows for the development of various noise reduction techniques based on spectral subtraction, Wiener filtering, and other linear filtering approaches. Examples of stationary noise include white Gaussian noise, pink noise, and many background noises such as the hum of electrical equipment or the hiss in an audio recording. Stationary noise is often characterized by its mean and variance, which remain constant over time. This means that the average value of the noise signal does not change, and the spread or dispersion of the noise values remains the same. 5 2. Theory 2.1.2 Non-Stationary Noise Non-stationary noise, in contrast, is noise whose statistical properties change sig- nificantly over time. Non-stationary noise sources are time-varying and can be more challenging to model and suppress compared to stationary noise. Examples of non-stationary noise include complex engine noise, traffic noise, babble noise, and environmental sounds with time-varying characteristics. Modeling non-stationary noise requires more advanced techniques, as traditional sta- tionary noise reduction methods may not be effective. Adaptive filtering algorithms and time-frequency analysis approaches are commonly employed to track and adapt to the changing properties of non-stationary noise in real time. Mathematically, we can represent the observed signal x[t] corrupted by non-stationary noise v′[t] as: x′[t] = s[t] + v′[t] (2.2) Unlike in stationary noise, the non-stationary noise v′[t] is not assumed to be uncor- related with the speech signal s[t]. Its characteristics may vary significantly across time and frequency, requiring more sophisticated algorithms for effective noise re- duction. Non-stationary noise poses additional challenges for noise reduction algorithms be- cause the statistical properties of the noise change rapidly. As a result, traditional noise reduction techniques that assume stationary noise may not be effective in reducing non-stationary noise. To deal with non-stationary noise, advanced techniques such as time-frequency anal- ysis and adaptive filtering are often employed [13]. These techniques aim to track and adapt to the changing characteristics of the noise over time, allowing for more effective noise reduction. 2.2 Beamforming Beamforming is a versatile approach to spatial filtering that has found applications in diverse fields such as radar, sonar, wireless communication, and medical imaging. It is a technique that focuses a transmitted or received signal in a specific direction, effectively enhancing the signal-to-noise ratio and improving system performance. The concept of beamforming dates back to the early 20th century, with its roots in antenna design and radar systems. Over the years, it has evolved and adapted to various technological advancements. Beamforming relies on constructive and destructive interference to steer a beam of electromagnetic waves in a desired di- rection. This can be achieved through various algorithms and techniques, including delay-and-sum, minimum variance, and adaptive beamforming[2]. Beamforming is a technique used to enhance the desired signal by selectively com- bining the signals from an array of microphones. It works by focusing the array response towards the desired direction while suppressing the interference from other directions. The beamformer weights are typically determined based on the spatial properties of the signal and the noise, as well as the geometry of the microphone array. 6 2. Theory In Figure. 2.1, the upper diagram displays a target location with differing delays (t2, t1, and 0) applied to the microphone signal. These delays are deliberately chosen such that the signal constructively interferes when summed. While signals from other directions remain at their original amplitude because the delays applied do not align the incoming signals. Thus increasing the signal-to-noise ratio (SNR). However, depending on the signal frequency and spacing of the microphones, a signal from an undesired direction may be amplified. In the process of creating delay and virtually moving the source location, an arrival time difference ∆t is calculated as follows: ∆t = (xA − xB)2 + (yA − yB)2 c (2.3) where xA, yA and xB, yB are the coordinates of the evaluation points. Since the signals in this thesis are mainly speech signals with noise, a broadband beamforming process was implemented. The method process is shown in section 3.1 where a frequency domain beamforming is performed. The inverse frequency response function is used to store the phase and amplitude information for the delay-and-sum beamforming. Figure 2.1: Delay-sum beamforming diagram. Image from[11] 2.3 Noise cancellation Noise cancellation algorithms are a class of digital signal processing techniques used to suppress unwanted noise from a corrupted signal, enhancing the quality of the desired signal. These algorithms find extensive applications in speech processing, audio enhancement, and communication systems. This chapter provides an overview of noise cancellation principles, and explains the assumptions made in this thesis in order to perform noise cancellation. 7 2. Theory Figure 2.2: Noise Cancellation theory in current case[12] Noise cancellation is based on separating the desired signal from the background noise, assuming that the noise can be modeled or estimated. From Figure. 2.2, the corrupted signal received by the speech and noise microphone can be represented as the same as Equation 2.1: x[t] = s[t] + v[t] where x[t] is the observed signal, s[t] is the desired signal, and v[t] is the noise component. The goal of noise cancellation algorithms is to estimate or approximate the noise v[t] and then subtract it from x[t] to obtain an enhanced version of the desired signal s[t][12]. In this case, suppose the speech comes from position 2 and the noise mainly around the cabin. We assume the left side 5 microphones primarily receive the noise signals and the right side 5 microphones receive the noise and speech signals. u[n] is the 8 2. Theory noise microphone signal which mainly contains noise. An adaptive filter is applied to predict the noise signal y[n]in speech and noise microphone position and make the error the least. In this situation, if the prediction is close to the actual noise, a minimum error will be addressed which is the desired signal x[n]. Adaptive noise cancellation is a widely used technique that employs adaptive fil- tering to estimate and remove the noise component from the observed signal. The algorithm uses an adaptive filter to approximate the characteristics of the noise and adaptively update its coefficients to minimize the error between the desired signal and the filtered signal. The most common adaptive noise cancellation algorithm is the Filtered-X Least Mean Squares (FXLMS) algorithm. In this approach, the adaptive filter’s coeffi- cients are updated iteratively based on the error between the reference signal (esti- mated noise) and the observed signal. It extends the classical Least Mean Squares (LMS) algorithm to efficiently adapt filters for the purpose of reducing unwanted components in signals. The core equation of the FXLMS algorithm can be expressed as follows: θ(k + 1) = θ(k) + µ · e(k) · x(k) (2.4) Where: θ(k + 1) : Updated filter coefficients at iteration (k + 1) θ(k) : Current filter coefficients at iteration k µ : Adaptation step size (learning rate) e(k) : Error signal at iteration k x(k) : Reference input at iteration k The goal of the FXLMS algorithm is to minimize the error signal e(k), which is the difference between the desired output and the actual output. By iteratively adjusting the filter coefficients using this formula, the algorithm seeks to converge to a set of coefficients that effectively cancels or reduces the unwanted components in the signal. To better understand how the FXLMS algorithm works, let’s consider an illustrative example. Suppose you have a reference input x(k) that contains noise, and your goal is to cancel this noise from the output signal. The FXLMS algorithm adapts a filter with coefficients θ(k) to generate an estimate of the noise, denoted as n̂(k). The filtered output ŷ(k) can then be computed as: ŷ(k) = x(k) ∗ θ(k) (2.5) Where: ŷ(k) : Filtered output at iteration k x(k) : Reference input at iteration k θ(k) : Filter coefficients at iteration k ∗ : Convolution operation 9 2. Theory The error signal e(k) is calculated as the difference between the desired output and the filtered output: e(k) = d(k) − ŷ(k) (2.6) Where: e(k) : Error signal at iteration k d(k) : Desired output at iteration k The FXLMS algorithm uses this error signal to update the filter coefficients accord- ing to the formula mentioned in Equation 2.4. Through this iterative process, the filter adapts to minimize the error, ultimately resulting in the effective reduction of noise in the output signal. 2.4 Noise suppression 2.4.1 Overview Noise suppression is a crucial aspect of various applications, including audio signal processing, speech recognition, and telecommunications. Over the years, researchers have explored and developed a wide range of techniques to reduce or eliminate unwanted noise from signals. This literature overview provides a summary of some key methods and studies related to noise suppression. One thing to be mentioned is that noise suppression is a term with a wide definition. The methods introduced in this section expect no pre-knowledge about the noise which is different from the beamforming and noise cancellation. Popular approaches include spectral subtraction, wiener filtering, and deep Learning- Based Approaches. One of the earliest methods for noise reduction is the spectral subtraction technique. Initially proposed by Ephraim and Malah in 1984, this approach estimates the noise power spectral density and subtracts it from the noisy signal’s spectrum. The spec- tral subtraction method has been widely used in various applications such as speech enhancement and audio denoising [4]. Wiener filtering, introduced by Norbert Wiener, is a classical method for signal estimation and noise reduction. It aims to minimize the mean square error between the desired signal and the estimated signal. This approach has found applications in image processing, audio, and speech signal enhancement [6]. Deep learning techniques, particularly Convolutional Neural Networks (CNNs), have shown remarkable success in noise suppression tasks. These networks learn complex features directly from the data and can effectively reduce noise in various applica- tions, including audio and image processing [10]. Recurrent Neural Networks(RNN), such as Long Short-Term Memory (LSTM) net- works, are employed in speech enhancement tasks to capture temporal dependen- cies and context. RNN-based models have demonstrated state-of-the-art results in speech denoising [10]. The specific method used in this thesis is Minima Controlled Recursive Averaging (MCRA), which is a special kind of spectral-subtractive algorithm. 10 2. Theory 2.4.2 General spectral-subtractive speech enhancement con- figuration Figure 2.3: General form of the spectral subtraction algorithm. Flow chart from [10] The general form of the spectral subtraction algorithm follows the process shown in Figure.2.3. Suppose the Speech input is represented as: y[t] = x[t] + v[t] where the noisy speech y[t] equals the sum of clean speech x[t] and noise v[t]. This signal is divided into overlapping frames by the application of a window function and analyzed using the short-time Fourier transform (STFT). Specifically, Y (k, l) = N−1∑ n=0 y(n + lM)h(n)e−j(2π/N)nk (2.7) where k is the frequency bin index, l is the time frame index, h is an analysis window of size N (e.g., Hanning window), and M is the framing step (number of samples separating two successive frames). Let X(k, l)denote the STFT of the clean speech, then its estimate is obtained by applying a specific gain function to each spectral component of the noisy speech signal: X̂(k, l) = G(k, l)Y (k, l) (2.8) Using the inverse STFT, with a synthesis window ĥ that is biorthogonal to the analysis window h, the estimate for the clean speech signal is given by 11 2. Theory x̂(n) = ∑ l N−1∑ k=0 X̂(k, l)ĥ(n − lM)ej(2π/N)k(n−lM) (2.9) where the inverse STFT is efficiently implemented using the weighted overlap-add method. The crucial step in this process is Noise spectrum estimation. One approach to noise estimation is to use the assumption that the noise is stationary and estimate its statistics from a segment of the audio signal where there is no speech present. Another approach is to use the fact that the noise is often present during silent intervals and estimate its statistics from these intervals. 2.4.3 Minima Controlled Recursive Averaging (MCRA) noise estimation Figure 2.4: Theory of MCRA algorithm.Flow chart from[10] The algorithm works by recursively averaging over a given frequency bin across ad- jacent processing blocks, thus smoothing out any random noise that may be present. The amount of smoothing is controlled by a parameter called the forgetting factor, 12 2. Theory which determines how much weight is given to older versus newer samples in the averaging process. From the flow chart created by Louzu, Figure.2.4, the process for noise estimation and Spectral gain is shown. The formula for the MCRA algorithm can be expressed as follows in yn = α(xn − min(xn−m, ..., xn+m)) + (1 − α)yn−1 (2.10) where: yn is the current output sample at time n − xn is the current input sample at time n - m is the half-length of the smoothing window α is the forgetting factor, typically chosen to be close to 1 to favor more recent samples min(xn−m, ..., xn+m) is the minimum value over the previous 2m + 1 samples yn−1 is the previous output sample. The Minima Controlled Recursive Averaging (MCRA) noise estimation method pro- posed by Israel Cohen is a widely used approach for estimating the noise power spec- tral density (PSD) in non-stationary noise environments [9]. The method is based on the observation that the minimum value of the PSD is a good estimate of the noise PSD in the absence of speech or other signals of interest. The MCRA algorithm estimates the noise PSD by recursively averaging the minimum PSD estimates over time and frequency. Let x(n) be the noisy speech signal at time instant n, and let X(k, n) be the discrete Fourier transform (DFT) of a frame of x(t) at frequency bin k. The noise PSD estimate V̂ (k, n) at frequency bin k and time instant n is given by: V̂ (k, n) = αk · min(V (k, n − 1), |X(k, n)|2) + (1 − αk) · |X(k, n)|2 (2.11) where V (k, n − 1) is the estimated noise PSD at frequency bin k and the previous time instant n−1, and αk is a smoothing factor that controls the rate of adaptation of the noise estimate. The smoothing factor αk is given by: αk = αlow, if V (k, n − 1) ≤ |X(k, n)|2 αhigh, otherwise (2.12) where αlow and αhigh are small and large smoothing factors, respectively. The idea behind the choice of αk is to adapt the noise estimate quickly when the input signal contains non-stationary components and to adapt slowly when the input signal is mainly noise. The MCRA algorithm estimates the noise PSD by applying Equation (1) recursively over time and frequency. The estimated noise PSD at time instant n and frequency bin k, denoted as D̂(k, n), is used to estimate the speech presence probability (SPP) at the same time and frequency, denoted as P̂ (k, n), using the following equation: P̂ (k, n) = |X(k, n)|2 |X(k, n)|2 + V̂ (k, n) (2.13) The estimated SPP is used in subsequent speech enhancement algorithms to suppress the noise and enhance the speech. 13 2. Theory 2.5 Musical noise Apart from stationary noise and non-stationary noise, musical noise is a type of noise which comes occurs when noise reduction methods are applied. This is found when some methods are applied in this thesis so it is important to understand it. Musical noise refers to an undesirable artifact that can occur during the process of noise suppression. Instead of suppressing noise uniformly, noise suppression algo- rithms may inadvertently create tonal or musical-like artifacts in the output signal. These artifacts are perceived as unnatural and disruptive to the listening experience. The name "musical noise" is derived from the fact that the resulting artifacts often resemble musical tones or whistling sounds. These tones may vary in frequency, intensity, and duration, leading to an unpleasant listening experience. Musical noise is primarily caused by the excessive attenuation or over-adaptation of noise suppression algorithms. When noise is estimated and suppressed too aggres- sively, the adaptive filters used in these algorithms can start modeling the residual noise as part of the desired signal, leading to the creation of musical-like artifacts. Several factors can contribute to the occurrence of musical noise, including: • Over-Adaptation: When the noise suppression algorithm adapts too quickly or overestimates the noise, it may start to distort the desired signal, resulting in musical noise. • Insufficient Regularization: In some cases, inadequate regularization of the adaptive filters can cause them to "overfit" the noise, leading to the generation of musical artifacts. • Non-Stationary Noise: Musical noise can be more pronounced in the presence of non-stationary noise, as its characteristics change over time and challenge the adaptability of the algorithms. • Insufficient Data: If the algorithm does not have sufficient data to accurately estimate the noise, it may produce inaccurate results and introduce musical noise. 2.6 Signal-to-Noise Ratio (SNR) SNR is a metric that compares the power of the speech signal to the power of the noise signal. It is defined as: SNR = 10 log10 Psignal Pnoise (2.14) where Psignal is the power of the speech signal and Pnoise is the power of the noise signal. A higher SNR indicates a higher-quality speech signal. However, SNR is limited in its ability to accurately reflect speech quality, as it only measures the signal’s power and does not consider the perceptual effects of noise on the speech signal. 14 3 Methods The non-stationary and low SNR speech enhancement always has a complex situa- tion where traditional methods are not very useful. Therefore, several methods are implemented in this work. They include beamforming and combined noise cancel- lation with beamforming and combined noise suppression with beamforming. 3.1 Beamforming Figure 3.1: Flow chart of beamforming algorithm The delay and sum beamforming is used in this method. Suppose the sound comes from location 2, the algorithm flow is as Figure. 3.1 shown. The 10 microphone positions follow the previous study which captures original signals when noise and speech are playing. The process is a frame processing that cuts each of the original signals as a frame of 25600 samples with 50% overlap. This will then be processed as a buffer. Hanning window and FFT are performed and the buffer spectrum is then multiplied by the 15 3. Methods inverse frequency response function(FRF) of the current transfer path. This step is performing the delay and sum. The inverse FRF is calculated by the measured impulse response of each source position to each mic position which contains the delaying and phase information of each transfer path. After this step, IFFT is performed on the filtered Buffer spectrum which gives the 10-channel delayed buffers. The final step is to sum up all ten channels and get the beamforming result. 3.2 Noise cancellation with Beamforming Figure 3.2: Flow chart of Noise cancellation with Beamforming algorithm The noise cancellation method is performed with beamforming. This requests a pre-knowledge of the location property of the noise and the speech source. In this case, when the speech source comes from position 2, some assumptions are made to fulfill the noise cancellation method. As Figure. 3.2 shows the 5 microphones on the right-hand side with a blue mark as closer to the speech source. They are considered to receive both speech and noise signals. On the left-hand side, 5 microphones are considered only receiving noise signals. The noise cancellation algorithm is used among the left and right sides in pairs. The FXLMS method is used to determine the noise cancellation result. After processing, 10 channels of microphone signals become a 5-channel canceled signal. Then beamforming is performed using these cancelled signals. The beamforming process follows the procedure of the above section. 16 3. Methods 3.3 Noise suppression with beamforming Figure 3.3: Flow chart of noise suppression with beamforming algorithm Since the noise suppression method is used for a single channel, the combined noise suppression with the beamforming method is implemented in 2 different ways. The procedure can be seen in the Figure. 3.3. The above figure shows, the first use of 10- channel microphone recordings to do the beamforming. After getting the 1-channel beamforming result, the noise suppression algorithm is applied to get the result for the combined method. The figure below shows another way in which the noise suppression method is first applied to 10 microphone recordings. After getting the 10-channel suppressed sig- nals, a beamforming process is performed to get the combined result. These 2 processes will introduce different effects and the result will be discussed in the later chapter. 17 3. Methods 18 4 Results With the aim of objectively evaluating the performance of different speech enhance- ment methods and their limits. The beamforming method is established based on Tomoya’s thesis and the result includes the SNR of the beamforming output for 6 different positions. For the combined methods, position 2 was chosen to show the result to avoid repetition. After the SNR and spectrogram result, a robustness check is evaluated for each method to be able to check the limitations and the stability. 4.1 Beamforming The beamforming method is performed in 6 different target positions, shown in Figure 4.1. The results of this section present the performance of each beamforming target point under the same level of mixed signals (Speech add noise signal) playing at each speaker position. A robustness check is performed using the result of target position 2. Figure 4.1: Beamforming target point and microphone positions 19 4. Results 4.1.1 Beamforming SNR This section compared the SNR before and after the beamforming. The mixed signals were played at each of the loudspeaker positions from position 1 to 6. 10 mi- crophones were recorded for each set of measurements. The beamforming algorithm is applied to each set of recording conditions. The SNR before and after beamform- ing is calculated. In Table 4.1., the first column shows the 6 positions. The second column is the best single mic signal with max SNR chosen from each of the positions. In this column, positions 1 and 2 have higher SNR, position 4 has the worst SNR in the unprocessed signals. By looking at the SNR values, we found they are all really low(all of them are lower than -9 dB), which means in these cases, the noise is strong and the speech is corrupted badly. The obvious reason is that from position 1 and 2, the sound sources are closer to the side microphones. Another reason is from the recordings, it is found to be noisier in the middle to the rare side of the machine so a better SNR is obtained when the sound sources are in positions 1 and 2. The third column is the SNR after beamforming. The output SNR in all positions increases around 2 dB to 3 dB. Among the results in 6 positions, position 2 gets the highest SNR(-6.5 dB) after beamforming. When calculating the SNR difference before and after beamforming, we found the SNR in positions 2, 3, and 4 increased a lot. They have over 30% differences, and position 2 has the most significant difference which is 32.3 %. While position 5 has the least improvement(11.1 %). From the SNR result, position 2 is shown to have the best beamforming performance, and position 5 has the worst beamforming performance. Table 4.1: Comparison of SNR between single microphone and beamforming out- put. Position Max. SNR(single Mic)[dB] BF SNR [dB] Difference % 1 -9.6 -7.4 22.1 2 -9.4 -6.5 32.3 3 -13.2 -9.0 31.6 4 -15.4 -10.8 30.1 5 -15.0 -13.4 11.1 6 -14.0 -11.0 21.5 4.1.2 Robustness check by moving the source position The robustness check of the beamforming method is very important. Since the fixed beamformer is used, slightly changing the source location will probably influence the performance of the beamforming. It is important to know what is the property of the beamformer and in which cases the result is not stable. This section is to evaluate the beamformer by moving the source in different areas. From the SNR result in the last section, position 2 has the most difference among 20 4. Results all positions. We assume this beamformer has the best performance. According to this, the recording data from position 2 is chosen for the robustness check. In order to manipulate the source in different positions, the recordings of this measurement are adjusted. This is done by calculating the time of arrivals from each new position and applying different delays to each of the signals to virtually change the source positions according to Equation 2.3. The moving area is shown in the blue area of the plan drawing and the SNR is plotted using a surface plot with a resolution of 11 on both the x-axis and y-axis,i.e. from -1m to +1m, there will be 11 points being equally distributed. Figure 4.2: Output SNR when the source is moving in the blue square(0.5m shifting from center). In the first case, the source is moving in a 1m* 1m square, the center point is location 2, see Figure 4.2. From the surface plot on the right side, it is obvious that the highest SNR(-7.4 dB) is in the (0,0) position which is the optimum position for beamforming. The interesting point is that, when the source is slightly moved in the x-axis direction (within 0.5m), the SNR decreases fast till around -9 dB. While the source slightly moving in the y-axis, the result is different. The SNR decreases slower when the source moves off the center and the minimum SNR is around - 7.6dB. This shows the beamformer has a better tolerance in the y direction than the x direction when moving the source slightly (within 0.5m from the center). 21 4. Results Figure 4.3: Output SNR when the source is moving in the blue square(1m shifting from center). In the second case, the source is moving in a 2m* 2m square, the center point is location 2, see Figure 4.3. When the source is moving in a larger area, the output SNR of beamforming changes differently. A periodic feature is observed in both the x-axis and y-axis. When the source is moving away from the center, the SNR will first decrease and then increase. The change in the x direction seems to have a stronger influence than the y direction. Figure 4.4: Output SNR when the source is moving in the blue square(3m shifting from center). In the third case, the source is moving in a 6m* 6m square, the center point is location 2, see Figure 4.4. In this situation the source moves across a large area, even reaching positions 1 and 3. The periodic feature is clear in the surface plot, and more peaks and dips appear in the graph. The output SNR fluctuates with a decreasing trend when the source is moving apart from the center. 22 4. Results 4.2 Noise cancellation with beamforming This section displays the result of the combined Noise cancellation and beamforming method. The first part plots the spectrogram result. The second part performs the robustness check of this method. 4.2.1 Spectrogram comparison Figure 4.5: Spectrogram of noise cancellation with beamforming algorithm The above Figure 4.5 compares the results of clean speech, beamforming without cancellation, and beamforming with noise cancellation. The top graph is the spec- trogram of the clean speech recording, it also contains some low-level background noise. In this case, this signal can be noticed as the desired signal. The middle graph shows the result after beamforming. It is clear that broadband noise remains in the result. Especially below 4k Hz, the noise level is high and most useful speech contents are corrupted. The bottom graph is the result of combined noise cancellation and beamforming. The noise is reduced a lot in the middle frequency, which helps increase the sound quality. The low-frequency noise is still left in the result as well as some frequency tones and modulation noise. The speech is not clear after the process. This is shown by comparing the top and bottom graph in the middle frequency. The combined 23 4. Results noise cancellation and beamforming method can reduce both speech and noise levels. One reason for this is that the noise cancellation method requires a noisy signal channel and the noise channel is not correlated. But in this case, the noisy channels are still mainly receiving noise because of the low input SNR. This makes it hard for the LMS process to find the optimum solution. 4.2.2 Robustness check by changing the input noise per- centage The robustness check is done by increasing the percentage of the input noise and running this method under different input signal situations, i.e. 10 % of input noise means a mixed noisy signal with 10 % multiplied by the noise signal amplitude plus 90 % multiply by the clean speech signal. In this way, from 5 % noise (low noise level) to 95 % noise (high noise level) situations are simulated. 24 4. Results Figure 4.6: Robustness check for first Noise suppression then beamforming. In Figure 4.6, the upper one is the output SNR of different noise percentages. The x-axis is the percentage of input noise increasing with a 5 % step size, from 5 % to 95 %. The y-axis represents the SNR level in dB. The red line is the SNR of the Mic 5 recording. This is the best single-channel recording with the highest SNR and this can be a reference value. If only using the microphone to record without any post-processing, the red line is the best result that can be achieved. When the 25 4. Results input noise percentage rises, the output SNR decreases, which is only due to the noise ratio being higher. The light blue line is the only beamforming result. We can see an overall increase in the SNR compared with the single-mic result. What is more important to mention is that the higher the noise percentage is in the input signal, the larger the SNR improvement can be achieved. The beamforming method is always bringing an increase to the SNR. The combined method result is shown with the dark blue line. The SNR is relatively stable and slowly decreasing from around -1 dB to around -5 dB. In this situation, the highest SNR result sounds not good, because most of the information is canceled. After around 23 % noise input, the SNR starts to be better than the single Mic 5 result. After 55 % noise has been added to the input signal, the combined BF and NC method starts to get better results than the beamforming result. In the lower graph of Figure 4.6, the input SNR is set as the x-axis and the y-axis is still the output SNR. In this graph, the input SNR is calculated using the value of the Mic 5 recording. In other words, the Mic 5 SNR is considered to be the input SNR before using any algorithms. This is another way of showing the same result from the upper one. This graph shows the relationship between output SNR and input SNR. The light blue and the dark blue line represent the only beamforming SNR and the combined BF and NC method SNR is the same as the upper graph. From this graph, several facts can be observed. First, when the input SNR increases, the beamforming output SNR increases with a linear trend. Second, the combined method shows a slowly increasing trend when increasing the input SNR, and when the input SNR is larger than 0 dB, the combined method gets a stable SNR lower than 0 dB. 4.3 MCRA noise suppression with beamforming In this section, the results of the combined MCRA Noise suppression and beam- forming method are shown. One subsection shows the results of first beamforming then Noise suppression and first noise suppression then beamforming in spectrogram and coherence evaluation. The other subsection analyses the robustness of the first noise suppression and then the beamforming method. 26 4. Results 4.3.1 Spectrogram comparison and Coherence Figure 4.7: Spectrogram of noise suppression with beamforming algorithm The above Figure 4.7 compares the results of first beamforming then MCRA Noise suppression, only MCRA for single channel Mic 5 result and first MCRA noise sup- pression then beamforming. The top graph is the spectrogram of the first beamforming and then the MCRA Noise suppression result. The broad band noise in the middle frequency is reduced, also the speech frequency stays after the process. But the noise in the low-frequency range is still left. Another new problem shown in this graph is the musical noise. From the spectrogram, we are able to see many small dots in a wide frequency range. These tones vary in frequency, intensity, and duration, leading to an unpleasant listening experience. In this case, ’an underwater bubble sound is described’ when many volunteers listen to this noise. The middle graph shows the result of the single-channel MCRA noise suppression. The noise is suppression a lot and most of the noise in the middle frequency range is below -40 dB. However, the musical noise problem is even larger and becomes a tough problem when listening to the output signal. The frequency dots are more clear in the spectrogram. The bottom graph is the result of first noise suppression for 10 recordings and then 27 4. Results beamforming for the 10 suppressed signals. The result is similar to the top one which reduces the broadband noise and maintains the speech. The important improvement of this method is that first noise suppression and then beamforming will get rid of musical noise and improve the speech intangibility. This can be addressed both from the spectrogram and the listening experience. From the spectrogram, less clear dots in the frequency range are observed. A blurry amplitude noise is shown instead of those clear dots. When listening to this result, most people comment this is without much bubble noise and rate this result the best among all. As for the musical noise problem, it only appears in this method. This means it is introduced by the noise suppression process. Several reasons for the musical noise have been mentioned in the theory part, like over-adaptation, insufficient regular- ization, non-stationary Noise, and insufficient data. In this scenario, beamforming offers more channel recordings and aliens them to- gether to get more useful data which may increase the performance of the combined method. Figure 4.8: Coherence of clean speech and the result of different methods To further evaluate the performances of different results, coherence between the pro- cessed signals and clean speech signal of different methods is plotted, see Figure 4.8. The aim of speech enhancement is to get clean speech, so the coherence between the clean speech and the processed signal can be seen as an indicator of the performance. From the figure, we are able to see that the MCRA then BF method(Green line) is the best among 3 different methods. We observe a large dip between around 1800Hz to 2500Hz in all 3 coherence results. This may be the noise frequency that is left in all 3 results. 28 4. Results 4.3.2 Robustness check by changing the input noise per- centage The robustness check for this method is done by using the same way as the BF and NC methods, to increase the percentage of the input noise and run this method under different input signal situations. Also, input SNR is used for plotting the result. Figure 4.9: Robustness check for first noise suppression then beamforming. 29 4. Results In Figure 4.9, the upper one is the output SNR of different noise percentages. The x-axis is the percentage of input noise increasing with a 5 % step size, from 5 % to 95 %. The y-axis represents the SNR level in dB. The red line and the line blue line remain the same as Figure 4.6. The red line is the best single Mic SNR and the light blue line is the only beamforming result. The dark blue line in this case is the result of first Noise Suppression and then the beamforming method. The SNR is relatively stable and has an upside-down U shape. It first slowly increases from -4 dB to -2 dB and then decreases from around -2 dB to around -5 dB. In this situation, when the input SNR is low, the combined result is not good in the subjective listening. After around 27 % noise input, the SNR starts to be better than the single Mic 5 result. After 55 % noise has been added to the input signal, the combined BF and NS method starts to get better results than the beamforming result. In the lower graph of Figure 4.9, the input SNR is set as the x-axis and the y-axis is still the output SNR. This graph shows the relationship of output SNR and input SNR. The light blue and the dark blue line represent the only beamforming SNR and the combined BF and NS method SNR as the same as the upper graph. From this graph, the combined method shows a slowly increasing trend when increasing the input SNR is lower than -5 dB input SNR. When the input SNR is larger than -5 dB, the combined method is getting a lower SNR. 4.4 Informal listening for different methods Apart from objective data and spectrogram of different results, informal listening is also made to describe the perception of different results. I and my supervisor Nicklas addressed these evaluations after listening to different results, and these comments have been agreed upon by another 3 people in the Volvo CE team. The result is shown in the Table 4.2. Table 4.2: Subjective evaluation for different methods. Signal Quality Intelligibility Comment Mic 4 Bad Bad Low SNR with harsh noise hard Only NS for Mic 4 Good Bad Noise reduced, Output not clear Only BF for all Mic Bad Bad Modulation noise left NC then BF Medium Medium Good quality but speech is unclear First BF then NS Good Good Clear but with musical noise First NS then BF Good Good Clear without musical noise 30 5 Conclusion 5.1 Discussions In this thesis, several methods are implemented and analyzed. Based on different properties and knowledge of the noise and Speech signal, different methods have their own performance. In short, beamforming is a way to generally improve SNR by using multiple microphones and it requires the knowledge of the locations of the target source. Noise cancellation also requires the location property, i.e. the noise source should be nearer to the noise microphone(s) than the noisy speech source. As for Noise suppression, it is different in that it does not require a pre-knowledge of the noise source. The method estimates noise by different strategies like statistics. Regarding the results from the beamforming method, we can see an overall im- provement of SNR in all position results. Among all positions, position 2 has the best SNR and the largest increase before and after beamforming. In the robustness check, when moving the source in a small area, the SNR decreases slower in the y direction than in the x direction. When moving the source to a larger area, the SNR fluctuates with a periodic feature. This shows us the property of the beamforming process. The slight move in the y-axis is acceptable if the distance is within 0.5m, while the displacement on the x-axis will influence the result a lot. As for the combined beamforming and noise cancellation method, it can remove most of the mid-frequency noise and make the sound quality better. It is shown in the robustness check that the SNR result is better than just using beamforming when the input SNR is low. The problem is that a lot of speech information is lost during the processing. The combined beamforming and noise suppression method is the best among all. When we use BF first and then MCRA, the SNR is reduced a lot and the speech signal is much clearer than just using beamforming. But this method introduced musical noise which makes the listeners feel uncomfortable. While using MCRA first to process all 10 channel recordings and then perform beamforming gave us a better result. In this situation, although we added some low-level noise, a better intangibility is achieved. The result is less engine noise and less musical noise. Most of the listeners also rate this result as the best. From the robustness check of this method, we address that this method is also better than only using beamforming when the input SNR is low. A meaningful conclusion is that using beamforming to multiple noise suppression results is able to reduce musical noise. 31 5. Conclusion 5.2 Limitations There are limitations in this thesis that restrain the performance of different meth- ods. For beamforming, a more concentrated microphone array is preferred rather than the setup now which has 10 microphones far away from each other. This will introduce a difficulty to perform beamforming and also reduce the accuracy of the result. The combined beamforming and noise cancellation method requires that the noise microphones are close to the noise source and the noisy speech microphones are close to the speaker. This is not achieved properly in the application. The 2 different types of microphones are getting mainly the same noise and speech signals so the output result is not good in this method. When it comes to the combined beamforming and noise suppression method, it has the best result among all. It is hard to estimate the rapidly changing noise in this method, that’s why musical noise is left when using MCRA. 5.3 Real-world applications This thesis is meaningful for some real-world applications. A combined system can be designed for the machine cabin. The system is with microphone arrays and uses several different methods according to the noise and speech situation. There can be 3 buttons on the control panel. The driver is able to turn on beamforming when seeing the speaker in a certain direction outside the cabin. For example, as Figure 5.1 shows, if the the speaker is in position 2, the operator is able to turn on beamforming at position 2 to capture the speech. If the input SNR is low the operator is able to open the NC or NS function to have a future speech enhancement. Beamforming Positions Position 1 Position 2 Position 3 Position 4 Position 5 Position 6 NC NS Figure 5.1: Control Panel for Application 32 Bibliography [1] Gustaver, M. (2020) A Chalmers University of Technology Master’s thesis tem- plate for LATEX. Unpublished. [2] Barry D. Van Veen and Kevin M. Buckley (1988) Beamforming: A Versatile Approach to Spatial Filtering [3] Benesty, J., Chen, J., and Huang, Y. (2009). Microphone array signal process- ing. Springer Science and Business Media. [4] Ephraim, Y. and Malah, D. (1984). Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109-1121. [5] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Sys- tems [6] Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications. MIT Press [7] Hu, Y., and Loizou, P. (2006) Evaluation of objective measures for speech enhancement. [8] Antony W. Rix, John G. Beerends (2002) Perceptual evaluation of speech qual- ity (PESQ) – a new method for speech quality assessment of telephone networks and codecs. [9] Cohen, , Baruch Berdugo. (2001). Speech enhancement for non-stationary noise environments. Signal Processing 81 (2001) 2403–2418. [10] Loizou, P. C. (2013). Speech Enhancement: Theory and Practice, Second Edi- tion. CRC Press. [11] Greensted, Andrew. Delay Sum Beamforming(2012). URL:http://www.labbookpages.co.uk/audio/beamforming/delaySum.html. [12] Jose Maria, Giron-Sierra (2017) Digital Signal Processing with Matlab Exam- ples, Volume 2 Decomposition, Recovery, Data-Based Actions [13] Tim Van den Bogaerta, Simon Doclo(2017) Speech enhancement with multi- channel Wiener filter techniques in multimicrophone binaural hearing aids 33 Bibliography 34 A Appendix 1 FRF and Speech Signal Measurement Setup This measurement will conduct four measurements: FRF between the white noise (from monopole source) and microphones, time signal between pink noise (GEN- ELEC) and microphones, time signal speech signal vs time, and machine load record- ings. 6 locations of points of measurements (these locations are on one side of the ma- chine) 10 microphone inputs + 1 direct signal input 1. FRF Measurement Method (perfect case scenario) Purpose: Get the best-case transfer function between microphones on the machine and the monopole source Excitation: white noise from MS Outputs: FRF and Cross Correlation How many averages for the FRF: 30 Machine: OFF Post processing: Use TF to create a filter to use for the mixed signal use TF to correct the listen response since we want the listen response to be flat. 2. Pink Noise Measurement Method Purpose: Used as working material to perform beamforming + and other filters as see fit Excitation: pink noise from GENELEC speaker Output: Coherence, times series and cross correlation Only need to measure one location at a time Machine: OFF Post processing: none conducted. 3. Speech Signal Measurement Method (point of comparison) Purpose: Retrieve speech signal so can be used as point of comparison between beamformed and un- beamformed. Excitation: prerecorded speech from GENELEC speaker Output: Times series Machine: OFF Post processing: Conduct beamforming on the speech signal and evaluate the per- formance. 4. Machine Load Measurement Method I A. Appendix 1 Purpose: record noise Excitation: Machine engine at various load states below. Low idle (lowest RPM) High idle (highest RPM) 1400 rpm idle Low idle + high hydraulic pressure (lowest RPM + highest pressure in the hydraulic pump) Output: Times series Machine: various load states Post processing: mix with speech recording. Equipment • 10 microphones + 1 mic for source • SCADAS Mobile (MOB 2) • Power supply for MOB 2 • Extension cables for power to PC and MOB 2 • LMS monopole source speaker • Monopole source power supply • GENELEC speaker 1029A • Sticking tape • Spray can • Measuring tape • Monopole/Speaker stand • Cables (see below for more detail) • Computer • Computer power • L70H Cables: • Short BNC to 3.5 mm (BNC T to PC) • 8 m XLR to BINC (BNC T GENELEC) • BNC T connector (to split the source signal) • BNC female to female (adapter for T connector) • 1 m BNC to BNC (MOB 2 to BNC FF) • 8 m BNC to BNC (MOB 2 to monopole source) • 8x 3 or 4 m BNC to BNC (MOB 2 to cab mic) • 2x 5 m BNC to BNC (MOB 2 to rear mic) • 5 m BNC to BNC (MOB 2 to monopole Amp) • LAN cable (connect MOB 2 to PC) Excitation Measurement 1: White noise 200 Hz 10000 Hz (enough for 30 avg.) Measurement 2: Pink noise at 1.6 m height (enough for 30 avg.) Measurement 3: 30 seconds of the same prerecorded speech Measurement 3: 30 seconds of machine load states Location/Date Outside on test track, 2022-03-22 II A. Appendix 1 Spectral Testing Setup Frequency range: 0 – 10024 Hz Frequency resolution: 2 Hz Sampling rate: 51200 Hz III Department of Architecture and Civil Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms List of Figures List of Tables Introduction Purpose Related Work Previous Work and Setup Structure of the thesis Theory Stationary noise and Non-Stationary noise Stationary Noise Non-Stationary Noise Beamforming Noise cancellation Noise suppression Overview General spectral-subtractive speech enhancement configuration Minima Controlled Recursive Averaging (MCRA) noise estimation Musical noise Signal-to-Noise Ratio (SNR) Methods Beamforming Noise cancellation with Beamforming Noise suppression with beamforming Results Beamforming Beamforming SNR Robustness check by moving the source position Noise cancellation with beamforming Spectrogram comparison Robustness check by changing the input noise percentage MCRA noise suppression with beamforming Spectrogram comparison and Coherence Robustness check by changing the input noise percentage Informal listening for different methods Conclusion Discussions Limitations Real-world applications Bibliography Appendix 1