HK1159300B

HK1159300B - Apparatus and method for processing an audio signal for speech enhancement using a feature extraction

Info

Publication number: HK1159300B
Application number: HK11113430.8A
Authority: HK
Inventors: Christian Uhle; Oliver Hellmuth; Bernhard Grill; Falko Ridderbusch
Original assignee: Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.
Priority date: 2008-08-05
Filing date: 2009-08-03
Publication date: 2014-04-25

Abstract

An apparatus for processing an audio signal to obtain control information for a speech enhancement filter (12) comprises a feature extractor (14) for extracting at least one feature per frequency band of a plurality of frequency bands of a short-time spectral representation of a plurality of short-time spectral representations, where the at least one feature represents a spectral shape of the short-time spectral representation in the frequency band. The apparatus additionally comprises a feature combiner (15) for combining the at least one feature for each frequency band using combination parameters to obtain the control information for the speech enhancement filter for a time portion of the audio signal. The feature combiner can use a neural network regression method, which is based on combination parameters determined in a training phase for the neural network.

Description

Method and apparatus for processing audio signals for speech enhancement using feature extraction

Technical Field

The present invention relates to the field of audio signal processing, and in particular to the field of speech enhancement of audio signals so that the processed signals have speech content with improved objective or subjective audio intelligibility.

Background of the invention and Prior Art

Speech enhancement is applied to different applications. An important application is the use of digital signal processing in hearing aids. Digital signal processing in hearing aids provides a new and efficient means for the restoration of hearing impairment. In addition to a higher acoustic signal quality, digital hearing aids consider specific speech processing strategies. For some of these strategies, an estimation of the Speech-to-Noise Ratio (SNR) of the auditory environment is desirable. In particular, consider an application in which complex algorithms for speech processing are optimized for a particular acoustic environment, but such algorithms may fail in scenarios that do not meet specific assumptions. This is particularly useful for noise reduction schemes that may introduce processing artifacts in quiet environments or in scenarios where the SNR is below a certain threshold. The optimal choice of parameters for the compression algorithm and amplification may depend on the speech to noise ratio, so that adaptation of the parameter set based on the SNR estimation helps to prove this effect. Furthermore, the SNR estimation can be directly employed as a control parameter for the noise reduction scheme, e.g. wiener filtering or spectral subtraction.

Other applications are in the field of speech enhancement of movie sound. Some people have been found to have problems understanding movie speech content, for example due to hearing impairment. In order to keep up with the scenes of a movie, it is important to understand the associated speech of the soundtrack, e.g., monologue, dialog, broadcast, and narration. People with hearing difficulties often experience that background sounds such as ambient noise and music are presented at too high a level relative to speech. In this case, it is desirable to increase the level of the speech signal and attenuate the background music or, in general, increase the level of the speech signal relative to the total level.

The main method of speech enhancement is spectral weighting, also known as short-term spectral attenuation, as shown in fig. 3. The output signal y k is calculated by attenuating the subband signal X (omega) of the input signal X k in dependence on the noise energy in the subband signal.

In the following, it is assumed that the input signal x k is a mixture of the addition of the desired speech signal s k and the background noise b k.

x[k]＝s[k]+b[k] (1)

Speech enhancement is an improvement of the objective intelligibility and/or subjective quality of speech.

The frequency domain representation of the input signal is computed by means of a Short Time Fourier Transform (STFT), other time-frequency transform or filter bank as indicated by reference numeral 30. The input signal is then filtered in the frequency domain according to equation 2, so that the noise energy is reduced in view of calculating the frequency response G (ω) of the filter. The output signal is calculated by means of a time-frequency transformation or inverse processing of a filter bank, respectively.

Y(ω)＝G(ω)X(ω) (2)

Estimation using the input signal spectrum X (ω) and the noise spectrum at 31Or, similarly, using estimation of linear sub-band SNRA suitable spectral weight G (ω) for each spectral value is calculated.

The weighted spectral values are transformed back to the time domain at reference numeral 32. The main examples of noise Suppression rules are spectral subtraction [ S.Boll, "Suppression of acoustic noise in space using spectral subtraction", IEEE trans. on Acoustics, Speech, and Signal Processing, vol.27, No.2, pp.113-120, 1979] and wiener filtering. Assuming that the input signal is an additive mixture of speech and noise signals and that speech and noise are uncorrelated, the gain values for the spectral subtraction method are given in equation 3.

From linear sub-bands according to equation 4Similar weights are obtained in the estimation of (a).

Channel with a plurality of channels

Various extensions of spectral subtraction have been proposed in the past, namely subtraction factors and spectral bottom parameters [ M.Berouti, R.Schwartz, J.Makhoul, "Enhancement of spectrum corrected by optical noise", C.of the IEEE inside. Conf.on optics, Speech, and Signal Processing, ICASSP, 1979], generalized form [ J.Lim, A.Oppenheim, "Enhancement and bandwidth compression of noise", C.OFFIX.IEEE, vol 67, No.12, pp.1586-1604, 1979], adoption of perceptual standards (e.g. N.Visag, "Single channel enhanced on masking on mapping programs", IEEE, EP, P.1586-137, and S.P.sub.12, and multiple of spectrum, E.g. reflection, P.137-spectrum, and spectrum bottom parameters [ M.Berouti, R.Schwartz, J.M., P.P.P.P.P.P.P.P.P.P.S.S.P.P.S.S.P.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.A.S.S.S.S.S.S.S.S.S.S.S.S.S.A.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S. However, the crucial part of the spectral weighting method is the estimation of the instantaneous noise spectrum or the estimation of the subband SNR, which is prone to errors if the noise is unstable. Errors in noise estimation result in distortion of jagged noise, Speech components, or noise of music (an artifact described as "vibrato with pitch quality" [ p.

A simple way to perform noise estimation is to measure and average the noise spectrum while speech is paused. This approach does not work satisfactorily if the noise spectrum changes over time during voice activity and if the detection of a voice pause fails. Estimation methods for noise spectra even during speech activity have been proposed in the past, according to p.loizou, SpeechEnhancement: the method can be divided into:

● minimum soundtrack algorithm

● time regression averaging algorithm

● histogram-based algorithm

Noise spectrum estimation using minimum statistics has been proposed in r.martin, "Spectral sub-base on minor dynamics", proc.of EUSIPCO, Edingburgh, UK, 1994. The method is based on tracks of local minima of signal energy in each subband. Nonlinear update rules for noise estimation and faster updates are proposed In g.doblinger, "computational efficiency performance By Spectral minimum Tracking In sub bands", proc.of Eurospeech, Madrid, Spain, 1995.

The time regression averaging algorithm estimates and updates the noise spectrum whenever the SNR estimated at a specific frequency band is small. This is done by recursively calculating a weighted average of the past noise estimates and the current spectrum. The weights are determined as a function of the likelihood of speech occurrence, or as a function of the estimated SNR in a particular frequency band, for example in i.cohen, "noise evolution by minor controlled recursive estimation for robust speech enhancement", ieee signal proc.letters, vol.9, No.1, pp.12-15, 2002, and in l.lin, w.holmes, e.ambikai jah, "Adaptive noise estimation for speech enhancement", electronic letters, vol.39, No.9, pp.754-755, 2003.

Histogram-based approaches rely on the assumption that the histogram of sub-band energies is often bimodal. The large low energy patterns aggregate energy values for segments with no speech or segments with low speech energy. The high energy mode aggregates energy values of segments with voiced speech and noise. The Noise energy of a particular sub-band is determined from low energy modes [ H.Hirsch, C.Ehrlich, "Noise interference techniques for robust speed registration", Proc.of the IEEE int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Detroit, USA, 1995 ]. For a recent comprehensive review, reference is made to p.loizou, spech Enhancement: theory and Practice, CRC Press, 2007.

Methods for subband SNR Estimation based on guided learning using amplitude modulation features are described in j.tchorz, b.kollmeier, "SNR Estimation based on amplitude modulation analysis with applications to noise Estimation", IEEE Trans, Speech and audio Processing, vol.11, No.3, pp.184-192, 2003, and in m.kleinschmidt, v.hohmann, "Sub-band SNR Estimation using audio Processing," Speech Communication: special Issue on Speech processing for HearingAids, vol.39, pp.47-64, 2003.

Other Speech Enhancement methods are pitch-synchronous filtering (e.g., in R.Frazier, S.Samsam, L Braida, A.Oppenheim, "Enhancement of Speech by adaptive filtering", Proc.of the IEEE iht. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Philadelphia, USA, 1976), STM (spectral Temporal modulation) filtering (e.g., in N.Mesgarani, S.Shamma, "Speech based on filtered the Speech-Temporal modulation", Proc.of the IEEE internal. Conf. Acmatics, Speonech, and Signal Processing, ICASSP, Philadelphia, USA, sinusoidal based on input Signal, representing, for example, Audio filtering, Speech-adaptive filtering, Specification, and Speech coding, and noise coding, and coding.

As described in j.tchorz, b.kollmeier, "SNR Estimation based on amplification modulation and analysis with application to noise Processing", IEEE trans.on Speech and Audio Processing, vol.11, No.3, pp.184-192, 2003 and m.kleinschmidt, v.hohmann, "Sub-band SNR Estimation using Audio Processing", Speech Communication: the method reported in Special Issue on Special Processing for learning Aids, vol.39, pp.47-64, 200312, 13 for subband SNR estimation based on guided learning with amplitude modulation features has the disadvantage of requiring two spectrogram Processing steps. The first spectrogram processing step is to generate a time/frequency spectrogram of the time-domain audio signal. Next, in order to generate a modulation spectrogram, another "time/frequency" transformation is required, which transforms the spectral information from the spectral domain into the modulation domain. This additional transform operation creates problems due to the inherent system delay and time/frequency resolution issues inherent to any transform algorithm.

Another consequence of this procedure is that the noise estimate is very inaccurate in scenarios where the noise is unstable and a varying noise signal may occur.

Disclosure of Invention

It is an object of the invention to provide an improved concept for speech enhancement.

According to a first aspect, this object is achieved by an apparatus for processing an audio signal to obtain control information for a speech enhancement filter, the apparatus comprising: a feature extractor for obtaining a timing of a short-time spectral representation of the audio signal and for extracting at least one feature in each of a plurality of frequency bands for the plurality of short-time spectral representations, the at least one feature representing a spectral shape of the short-time spectral representation in a frequency band of the plurality of frequency bands; and a feature combiner for employing the combining parameters for combining the at least one feature for each frequency band employing the combining parameters to obtain control information for a speech enhancement filter for a time portion of the audio signal.

According to a second aspect, this object is achieved by a method of processing an audio signal to obtain control information for a speech enhancement filter, the method comprising: obtaining a timing of a short-time spectral representation of an audio signal; extracting at least one feature for each of a plurality of frequency bands of a plurality of short-time spectral representations, the at least one feature representing a spectral shape of a short-time spectral representation in a frequency band of the plurality of frequency bands; and combining at least one feature for each frequency band using the combining parameters to obtain control information for the speech enhancement filter for the time portion of the audio signal.

According to a third aspect, the object is achieved by an apparatus for speech enhancement in an audio signal, comprising: means for processing the audio signal to obtain filter control information for a plurality of bands representing temporal portions of the audio signal; and a controllable filter controllable so that bands of the audio signal are variably attenuated with respect to different bands based on the control information.

According to a fourth aspect, the object is achieved by a method of speech enhancement in an audio signal, the method comprising: a method of processing an audio signal for obtaining filter control information for a plurality of bands representing time portions of the audio signal; and controlling the filter so that a band of the audio signal is variably attenuated with respect to a different band based on the control information.

According to a fifth aspect, this object is achieved by an apparatus for training a feature combiner for determining combination parameters of the feature combiner, the apparatus comprising: a feature extractor for obtaining a timing of a short-time spectral representation of a training audio signal for which control information of a speech enhancement filter for each frequency band is known, and for extracting at least one feature in each of a plurality of frequency bands for a plurality of short-time spectral representations, the at least one feature representing a spectral shape of the short-time spectral representation in the frequency band of the plurality of frequency bands; and an optimization controller for providing the at least one feature for each frequency band to the feature combiner, for calculating the control information using intermediate combining parameters, for changing the intermediate combining parameters, for comparing the changed control information with known control information, and for updating the intermediate combining parameters when the changed intermediate combining parameters yield control information better matching the known control information.

According to a sixth aspect, this object is achieved by a method of training a feature combiner for determining combining parameters of the feature combiner, the method comprising: obtaining a timing of a short-time spectral representation of a training audio signal for which control information for a speech enhancement filter for each frequency band is known; extracting at least one feature in each of the plurality of frequency bands for a plurality of short-time spectral representations, the at least one feature representing a spectral shape of a short-time spectral representation in a frequency band of the plurality of frequency bands; providing the at least one feature for each frequency band to the feature combiner; calculating the control information by adopting an intermediate merging parameter; changing the intermediate merging parameter; comparing the changed control information with known control information; updating the intermediate combining parameters when the changed intermediate combining parameters result in control information that better matches the known control information.

According to a seventh aspect, this object is achieved by a computer program for performing any of the inventive methods when run on a computer.

The present invention is based on the finding that band-shape (base-wise) information, which is the spectral shape of an audio signal in a particular band, is a very useful parameter for determining control information for a speech enhancement filter. In particular, the spectral shape information features for a plurality of bands and for a band shape determination for a plurality of subsequent short-time spectral representations provide a useful characterization for speech enhancement processing of audio signals. In particular, a set of spectral shape features, where each spectral shape feature is associated with a band of a plurality of spectral bands, such as a barker band or in general a band having a variable bandwidth among a range of frequencies, has provided a useful set of features for determining a signal/noise ratio for each band. For this purpose, the spectral shape features for the plurality of bands are processed by a feature combiner for combining the features using combining parameters to obtain control information for the speech enhancement filter for the time portion of the audio signal for each band. Preferably, the feature combiner comprises a neural network controlled by some combination parameters, wherein the combination parameters are determined in a training phase performed before actually performing the speech enhancement filtering. Specifically, the neural network performs a neural network regression method. A particular advantage is that the audio material, which may differ from the actual speech-enhanced audio material, can be used to determine the combination parameters in a training phase, so that the training phase can be performed only once, and after this training phase the combination parameters are fixedly set and can be applied to each unknown audio signal with speech, which is comparable to the speech characteristics of the training signal. For example, such speech features may be a language or group of languages, such as European versus Asian languages, and so forth.

Preferably, the inventive concept estimates noise by learning features of speech using feature extraction and neural networks. Where the innovatively extracted features are direct low-level spectral features, which can be extracted in an efficient and easy way and importantly without large system-inherent delays, the innovatively idea is particularly useful for providing accurate noise or SNR estimates, even in scenarios where the noise is unstable and variable noise signals occur.

Drawings

The following detailed description of preferred embodiments of the invention refers to the accompanying drawings, in which:

FIG. 1 is a block diagram of a preferred apparatus or method for processing an audio signal;

FIG. 2 is a block diagram of an apparatus or method for training a feature combiner in accordance with a preferred embodiment of the present invention;

FIG. 3 is a block diagram illustrating a speech enhancement apparatus and method in accordance with a preferred embodiment of the present invention;

FIG. 4 shows an overview of the steps for training the feature merger and for neural network regression with optimized merging parameters;

FIG. 5 is a plot showing a gain factor as a function of SNR, where the applied gain (solid line) is compared to the spectral subtraction gain (dotted line) and the wiener filter (dashed line);

FIG. 6 is an overview of the characteristics of each band and preferred additional characteristics for full bandwidth;

FIG. 7 is a flow diagram illustrating a preferred embodiment of a feature extractor;

FIG. 8 illustrates a flow chart of a preferred embodiment for illustrating the calculation of the gain factor for each frequency value and the subsequent calculation of the speech enhancement signal portion;

FIG. 9 shows an example of spectral weighting, showing the input temporal signal, the estimated sub-band SNR, the estimated SNR in the frequency bins after interpolation, the spectral weights and the processed temporal signal; and

FIG. 10 is a schematic block diagram of a preferred embodiment of a feature merger employing a multi-layer neural network.

Detailed description of the preferred embodiments

Fig. 1 shows a preferred arrangement for processing an audio signal 10 to obtain control information 11 for a speech enhancement filter 12. The speech enhancement filter may be implemented in a number of ways, such as a controllable filter for filtering the audio signal 10 with control information for each of a plurality of frequency bands to obtain a speech enhanced audio output signal 13. As shown subsequently, the controllable filter may also be implemented as a time/frequency transform, wherein a separately calculated gain factor is applied to the spectral values or spectral bands prior to the subsequently performed frequency/time transform.

The apparatus in fig. 1 comprises a feature extractor 14 for obtaining a timing of a short-time spectral representation of the audio signal and for extracting at least one feature in each of a plurality of frequency bands for a plurality of short-time spectral representations, wherein the at least one feature represents a spectral shape of the short-time spectral representation in a frequency band of the plurality of frequency bands. In addition, the feature extractor 14 may be implemented to extract other feature portions in addition to the spectral shape features. At the output of the feature extractor 14, there are several features of each audio short-time spectrum, wherein these several features comprise at least a spectral shape feature for each of a plurality of frequency bands of at least 10 or preferably more, e.g. 20-30. These features may be used as they are or may be processed using averaging or other processing, such as geometric mean, arithmetic mean or median (mean processing) processing or other statistical moment processing (e.g., variance, skew, etc.) to obtain raw or average features for each band so that all of these raw and/or average features are input to the feature merger 15. The feature combiner 15 combines a plurality of spectral shape features and preferably additional features using a combination parameter, which may be provided via a combination parameter input 16 or which may be hardwired or hard programmed into the feature combiner 15, whereby the combination parameter input 16 is not required. At the output of the feature combiner, control information for the speech enhancement filter for the "subband" or subbands of each band or bands for the time portion of the audio signal is obtained.

Preferably, the feature combiner 15 is implemented as a neural network regression circuit, but the feature combiner may also be implemented as any numerically or statistically controlled feature combiner that applies any combining operation to the features output by the feature extractor 14 in order to ultimately arrive at the desired control information, such as a band shape SNR value or a band shape gain factor result. In a preferred embodiment of the neural network application, a training phase is required ("training phase" means the phase in which learning from the examples is performed). In this training phase, the means for training the feature combiner 15 as labeled in fig. 2 are used. In particular, fig. 2 shows this apparatus for training a feature combiner for determining the combination parameters of the feature combiner. To this end, the apparatus in fig. 2 comprises a feature extractor 14, which feature extractor 14 is preferably implemented in the same way as the feature extractor 14 in fig. 1. In addition, the feature combiner 15 is also implemented in the same manner as the feature combiner 15 in fig. 1.

In addition to fig. 1, the apparatus of fig. 2 further comprises an optimization controller 20, the optimization controller 20 receiving as input control information for training the audio signal, as indicated by reference numeral 21. The training phase is performed based on a known training audio signal having a known speech/noise ratio in each band. For example, the speech part and the noise part are provided separately from each other, and the actual SNR of each band is measured at runtime, i.e., during the learning operation. In particular, the optimization controller 20 is operable to control the feature merger to provide the feature merger 15 with features from the feature extractor 14. Next, based on these features and intermediate merging parameters from previous iteration runs, the feature merger 15 calculates the control information 11. This control information 11 is supplied to the optimization controller and is compared in the optimization controller 20 with the control information 21 for the training audio signal. The intermediate merging parameters are changed in response to instructions from the optimization controller 20, and further sets of control information are calculated by the feature merger 15 using the changed merging parameters. When the further control information better matches the control information for the training audio signal 21, the optimization controller 20 updates the merging parameters and sends these updated merging parameters 16 to the feature merger for use as intermediate merging parameters in the next run. Alternatively, or in addition, the updated merge parameters may be stored in memory for later use.

Fig. 4 shows an overview of spectral weighting processing using feature extraction in a neural network regression method. The parameters w of the neural network are calculated from the features in the training item xt k during the training phase, which is indicated on the left side of fig. 4, using the reference subband SNR values Rt. The noise estimation and speech enhancement filtering are indicated on the right side of fig. 4.

The proposed concept follows the way of spectral weighting and uses a new approach for spectral weight calculation. The noise estimation is based on a guided learning approach and employs the feature set of the present invention. These features are directed to the discrimination of the tonal versus the noisy signal portion. Additionally, the proposed features allow for the evolution of signal performance over a larger time scale.

The noise estimation method provided herein can handle a wide variety of non-stationary background sounds. Robust (robust) SNR estimation in non-stationary background noise is obtained by means of feature extraction and neural network regression methods as shown in fig. 4. The actual estimated weights are calculated from estimates of the SNR in the bands whose spacing is close to the Bark scale (Bark scale). The spectral resolution of the SNR estimate is coarse, enabling the measurement of the spectral shape in the band.

The left side of fig. 4 corresponds to the training phase, which basically only needs to be performed once. The process on the left side of fig. 4 as indicated by training 41 includes reference SNR calculation block 21, reference SNR calculation block 21 generating control information 21 for the training audio signal input to optimization controller 20 in fig. 2. The feature extractor 14 on the training side in fig. 4 corresponds to the feature extractor 14 of fig. 2. In particular, fig. 2 has been shown to receive a training audio signal, which consists of a speech portion and a background portion. To enable useful reference, background section b_tAnd a speech signal S_tEach independently of the other, may be utilized and the two added by an adder 43 before being input to the feature extractor 14. The output of the adder 43 thus corresponds to the training audio signal input to the feature extractor 14 in fig. 2.

The neural network trainer, indicated at 15, 20, corresponds to blocks 15 and 20, the respective connections are as indicated in fig. 2 or by other similar connection implementations resulting in a set of merging parameters w, which may be stored in the memory 40. These merging parameters are then used in the neural network regressor 15, which neural network regressor 15 corresponds to the feature merger 15 of fig. 1, when the inventive concept is implemented as shown by the application 42 in fig. 4. The spectral weighter in fig. 4 corresponds to the controllable filter 12 of fig. 1 and the feature extractor 14 of fig. 4, and the right side corresponds to the feature extractor 14 of fig. 1.

A simple implementation of the proposed concept will be discussed in detail below. The feature extractor 14 in fig. 4 operates as follows.

A set of different features 21 has been investigated to identify the best set of features for subband SNR estimation. These features are combined in various configurations and evaluated by means of objective measures and informal listening (informal listening). The feature selection process produces a set of features including spectral energy, spectral flux, spectral flatness, spectral bias, LPC and RASTA-PLP coefficients. Spectral energy, spectral flux, spectral flatness, and spectral bias characteristics are calculated from spectral coefficients corresponding to the critical band scale.

These features are explained in detail with reference to fig. 6. Additional features are the Delta feature of the spectral energy and the Delta-Delta feature of the low pass filtered spectral energy.

The structure of the neural network used in blocks 15, 20 or 15 in fig. 4, or preferably in the feature merger in fig. 1 or 2, is discussed in connection with fig. 10. In particular, the preferred neural network includes an input neuron layer 100. Typically, n input neurons may be employed, i.e., one neuron per input feature. Preferably, the neuron network has 220 input neurons corresponding to the number of features. The neural network further includes a hidden layer 102, the hidden layer 102 having p hidden layer neurons. Typically, p is less than n, and in a preferred embodiment, the hidden layer has 50 neurons. On the output side, the neural network includes an output layer 104, the output layer 104 having q output neurons. In particular, the number of output neurons is equal to the number of frequency bands, so that each output neuron provides control information, e.g. SNR (speech to noise ratio) information for each frequency band, for each frequency band. If, for example, there are 25 different frequency bands, it is preferable to have a bandwidth that increases from low frequencies to high frequencies, then the number of output neurons q will be equal to 25. Therefore, a neural network is applied to the subband SNR estimation from the computed low-level features. As described above, the neural network has 220 input neurons and one has 50 neuron hidden layers 102. The number of output neurons equals the number of frequency bands. Preferably, the hidden neuron comprises a firing function that is hyperbolic tangent, and the firing function of the output neuron is an identity.

Typically, each neuron receives all corresponding inputs from either layer 102 or 104, which are the outputs of all input neurons with respect to layer 102. Next, each neuron of the layers 102 or 104 is weighted and added, wherein the weighting parameters correspond to the combining parameters. The hidden layer may include an offset value in addition to the parameter. In addition, the offset value also belongs to the merging parameter. In particular, the individual outputs are weighted by their respective combining parameters, as shown in the exemplary box 106 of FIG. 10, and the outputs of the weighting operation are input to the adder 108 in each neuron. The output of the adder or the input to the neuron may comprise a non-linear function 110, which may be provided at the output and/or input of the neuron, e.g., at a hidden layer as the case may be.

And calculating the reference SNR of the pure voice signal and the background noise by adopting the separated signals according to the weight of the mixed training neural network of the pure voice signal and the background noise. The training process is shown on the left side of fig. 4. Speech and noise are mixed at a SNR of 3dB per term and provided to feature extraction. This SNR is constant in time and is a wideband SNR value. The data set includes 2304 combinations of 38 speech signals and 48 noise signals each of length 2.5 seconds. The speech signals are generated from 7 different speakers of the language. The noise signals are the recordings of traffic noise, crowd noise and various natural atmospheres.

For a given spectral weighting rule, two definitions of the output of the neural network are appropriate: the neural network may be trained using reference values for the temporally varying sub-band SNR R (ω) or using spectral weights G (ω) (derived from the SNR values). Simulations using subband SNR as reference value yield better objective results and better assessment results in informal listening than a net (net) trained using spectral weights. The neural network is trained using 100 iterative cycles. A training algorithm is employed in this work, which is based on proportional conjugate gradients.

A preferred embodiment of the spectral weighting operation 12 will be discussed next.

Linearly interpolating the estimated sub-band SNR estimate to the frequency resolution of the input spectrum and transforming the estimated sub-band SNR estimate to a linear ratioLinear sub-band SNRs are smoothed along time and along frequency using IIR low-pass filtering to reduce artifacts, which can be caused by estimation errors. Low pass filtering along the frequency is further required to reduce the effect of circular convolution that would occur if the spectrally weighted impulse response exceeded the length of the DFT frame. It is done twice and the second filtering (starting with the last sample) is done in the reverse order so that the resulting filter has zero phase.

Fig. 5 shows the gain factor as a function of SNR. The applied gain (solid line) is compared to the spectral subtraction gain (dotted line) and the wiener filter (dashed line).

The spectral weights are calculated according to the modified spectral subtraction rule in equation 5, limited to-18 dB.

The parameters α ═ 3.5 and β ═ 1 were determined experimentally. This particular attenuation of 0dB SNR is chosen to avoid distortion of the speech signal at the expense of the jagged noise. The decay curve as a function of SNR is shown in fig. 5.

Fig. 9 shows an example of estimated subband SNRs and spectral weights for the input and output signals.

Specifically, fig. 9 has an example of spectral weighting: input time signal, estimated sub-band SNR, estimated SNR in frequency bins estimated after interpolation, spectral weights and processed time signal.

Fig. 6 shows an overview of preferred features extracted by the feature extractor 14. For each low resolution, the feature extractor tends to represent features of the spectral shape of the short-time spectral representation in the frequency band, for that band, i.e., for each of the 25 frequency bands requiring SNR and gain values. The spectral shape in the band represents the distribution of energy in the band and can be implemented by a number of different calculation rules.

A preferred spectral shape feature is a Spectral Flatness Measure (SFM), which is the arithmetic mean of the geometric mean of the spectral values divided by the spectral values. In the definition of geometric mean/arithmetic mean, a power is applied to each spectral value in a band before n power operations or averaging operations are performed.

In general, the spectral flatness measure may also be calculated when the power used to process each spectral value in the denominator is greater than the power used for the numerator in the calculation formula for SFM. Then, both the denominator and the numerator may include arithmetic value calculation equations. Illustratively, the power in the numerator is 2 and the power in the denominator is 1. Typically, the power for the numerator need only be greater than the power for the denominator to obtain a generalized spectral flatness measure.

From this calculation formula it is clear that SFM is less than 1 for the band where the energy is evenly distributed over the whole frequency band and for a number of frequency lines SFM is close to a small value close to 0, whereas in case the energy is concentrated in a single spectral value in the band, for example, the SFM value is equal to 1. Thus, a high SFM value indicates a band in which energy is concentrated at a certain position in the band, and a small SFM value indicates an even distribution of energy in the band.

Other spectral shape features include spectral bias, which measures the asymmetry of the distribution around its centroid. There are other features here that relate to the spectral shape of the short-term frequency representation in a certain frequency band.

When calculating the spectral shape for a frequency band, there are other features that are calculated for the frequency band, as shown in FIG. 6 and discussed in detail below. Also, there are additional features that are not necessarily calculated for the frequency band, but need to be calculated for the full bandwidth.

Spectral energy

Spectral energy is calculated for each time frame and frequency band, normalized by the total energy of the frame. Additionally, the spectral energy is low-pass filtered in time using a second order IIR filter.

Spectral flux

Spectral flux is defined as the dissimilarity between the spectra of successive frames 20 and is typically applied by way of a distance function. In this operation, the spectral flux is calculated using euclidean distances according to equation 6, where equation 6 has spectral coefficients X (m, k), a time frame index m, a sub-band index r, a lower limit and an upper limit l of the frequency band, respectively_rAnd u_r。

Spectral flatness measurement

There are various definitions for the calculation of the flatness of the vector or the pitch of the spectrum (which is inversely related to the flatness of the spectrum). The spectral flatness measure SFM used here is calculated as the ratio of the geometric mean to the arithmetic mean of the L spectral coefficients of the subband signal, as shown in equation 7.

Spectral bias

The spectral bias of a distribution measures the asymmetry around its centroid and is defined as the third-order central moment of the random variable divided by the cube of its standard deviation.

Linear prediction coefficient

The LPC is the coefficient of an all-pole filter, which predicts the actual value x (k) of the time series from previous values in order to square the errorAnd minimum.

The LPC is calculated by means of an autocorrelation method.

Mel-Frequency Cepstral coeffients (Mel-Frequency Cepstral coeffients)

The power spectrum is deformed according to the mel-scale (mel-scale) using a triangular weighting function with unit weights for each frequency band. MFCC is calculated by taking the logarithm and performing a discrete cosine transform.

Relative spectral perception linear prediction coefficient (RASTA-PLP)

RASTA-PLP coefficients were calculated from the power spectrum according to the following steps [ h.hermansky, n.morgan, "RASTA Processing of Speech", IEEE trans.on Speech and Audio Processing, vol.2, No.4, pp.578-589, 1994 ]:

1. amplitude compression of spectral coefficients

2. Band pass filtering of subband energy in time

3. Amplitude extension associated with the inverse processing of step 2

4. Multiplying by a weight corresponding to the same loudness curve

5. Simulation of power loudness perception by raising coefficients to 0.33

6. Calculation of a holopolar model of a spectrum generated by means of an autocorrelation method

Perceptual Linear Prediction (PLP) coefficients

The calculated PLP value is similar to RASTA-PLP, but no steps 1-3[ h.hermansky, "per linear predictive Analysis for Speech", j.ac.soc.am., vol.87, No.4, pp.1738-1752, 1990] are applied.

Delta feature

Delta features have been successfully applied in the past in automatic speech recognition and audio content classification. There are various ways for their computation. Here, the timing of the features is calculated by convolving them with a linear slope having a length of 9 samples (the sampling rate of the feature time series is equal to the frame rate of the STFT). The Delta-Delta feature is obtained by applying a Delta operation to the Delta feature.

As mentioned above, a band separation with a low resolution frequency band is preferred, which is similar to the perception situation of the human auditory system. Therefore, logarithmic or bucky-like band separation is preferred. This means that the band with a low center frequency is narrower than the band with a high center frequency. In the calculation of the spectral flatness measure, for example, the summation operation extends from the value q, which is usually the lowest frequency value in the band, to the count value u_rCount value u_rIs the largest spectral value in the predefined band. In order to have a better measure of spectral flatness, it is preferred to use some or all spectral values from the low frequency band and/or the high adjacent frequency band in the low band. This means that, for example, a spectral flatness measure for the second band is calculated using the spectral values of the second band and additionally using the spectral values of the first band and/or the third band. In a preferred embodiment, not only spectral values in either the first or second band are used, but also spectral values in the first and third band. This means that q in equation (7) is never equal to the first (lowest) spectral value l in the first band when calculating the SFM for the second band_rAnd is equal to the highest spectral value u in the third band_rAnd (5) expanding. Thus, a spectral shape based on a higher number of spectral values may be calculatedFeatures up to a certain bandwidth at which the number of spectral values within the band itself is sufficient for l_rAnd u_rRepresenting spectral values from the same low resolution frequency band.

As for the linear prediction coefficient extracted by the feature extractor, it is preferable to use LPC a of formula (8)_jOr residual/error values that remain after optimization or any combination of coefficients, such as multiplication or addition with a normalization factor, so that these coefficients and the squared error values affect the LPC features extracted by the feature extractor.

The advantage of the spectral shaped feature is that it is a low dimensional feature. For example, when considering a frequency bandwidth with 10 complex or true spectral values, the use of all these 10 complex or true spectral values will be useless and will be a waste of computational resources. Thus, a spectral shaped feature is extracted, which has a dimension that is smaller than that of the original data. For example, when considering energy, since the raw data has a dimension of 10, there are 10 squared spectral values. In order to extract a spectrum shape feature that can be effectively used, a spectrum shape feature having a dimension smaller than that of the original data, and preferably 1 or 2, is extracted. A similar dimensionality reduction for the original data may be obtained, for example, when a low-level polynomial fitting the spectral envelope of the frequency band is completed. For example, when only two or three parameters are appropriate, then the spectral shape feature includes these two or three parameters of a polynomial or any other parameterized system. Generally, all parameters representing the energy distribution in the frequency band and having a low dimension of less than 5% of the original data, or at least less than 50% of the original data, or only less than 30% of the original data, are useful.

It has been found that the advantageous behavior of the device for processing an audio signal can also be obtained using only spectral shape features, but preferably at least additional band shape features are used. An additional band shape feature used in providing improved results has been shown to be the spectral energy of each band, where the spectral energy of each band is calculated for each time frame and frequency band, and normalized by the total energy of the frame. The feature may or may not be low pass filtered. In addition, it has been found that the addition of spectral flow characteristics beneficially enhances the performance of the inventive device to obtain an efficient step in producing good performance when the spectral shape characteristics of each band are used in addition to the spectral energy characteristics of each band and the spectral flow characteristics of each band. This again enhances the performance of the inventive device, in addition to additional features.

As described for the spectral energy feature, a low pass filtering of this feature over time or a moving average normalization applied over time may be performed, but need not be applied. In the foregoing example, for example, an average of five pre-spectral shape features for respective bands is calculated, and the result of the calculation is taken as the spectral shape feature for the current band at the current frame. However, the averaging can also be done bi-directionally, so that for the averaging operation, not only features from the past, but also features from the "future" are used to calculate the current features.

Fig. 7 and 8 will be discussed subsequently to provide a preferred embodiment of the feature extractor 14 as described in fig. 1, fig. 2 or fig. 4. In a first step, the audio signal is windowed to provide a block of audio sample values, as shown in step 70. Preferably, the overlap is applied. This means that due to the extent of the overlap, the same audio signal occurs in two successive frames, wherein an overlap of 50% with respect to the audio sample values is preferred. In step 71, a time/frequency transformation of the block of windowed audio sample values is performed to obtain a frequency representation having a first resolution, which is a high resolution. For this purpose, a short-time fourier transform is obtained which is implemented with an efficient FFT. When step 71 is carried out a plurality of times with temporally subsequent blocks of audio sample values, a previously known map is obtained. In step 72, the high resolution spectral information, i.e. the high resolution spectral values, are grouped into low resolution frequency bands. When, for example, an FFT with 1024 or 2048 input values is performed, there are 1024 or 2048 spectral values, but such a high resolution is neither required nor desired. Instead, the grouping step 72 causes the high resolution atlas to be divided into a small number of bands, e.g. bands with variable bandwidth, e.g. as known from bark bands or from logarithmic band division. Next, after the grouping step 72, a calculation 73 of the spectral shape characteristic, and preferably other characteristics, of each of the resolution bands for use is performed. Although not shown in fig. 7, since any spectral separation obtained by step 71 or step 72 is not required for these full bandwidth features, the data obtained at step 70 may be used to calculate additional features relating to the entire frequency band.

Step 73 produces a spectral shaped feature having m dimensions, where m is less than n, preferably m is 1 or 2 per frequency band. This means that the information for the current frequency band after step 72 is compressed to the current low-dimensional information after step 73 by a feature extraction operation.

As shown in fig. 7, the steps of time/frequency conversion and grouping may be interchanged for different operations, near steps 71 and 72. The output of step 70 may be filtered with a low resolution filter bank, for example implemented such that 25 subband signals are obtained at the output. A high resolution analysis of each subband may then be performed to obtain the raw data for the spectral shape feature calculation. This may be done, for example, by an FFT analysis of the subband signals or by any other analysis of the subband signals, for example by a further cascaded (cascaded) filter bank.

Fig. 8 shows preferred steps for implementing the controllable filter 12 of fig. 1 or the spectral weighting feature shown in fig. 3 or 12 of fig. 4. After the determination step (as shown at step 80) of determining the low resolution band shape control information (e.g., sub-band SNR values), which is output by the neural network regression block 15 in fig. 4, interpolation to high resolution is performed at step 81.

The objective is to obtain a weighting factor for each spectral value obtained by the short time fourier transform performed in step 30 of fig. 3, or in step 71, or the optional steps shown on the right hand side of 71 and 72. After step 81, SNR values for each spectral value are obtained. However, this SNR value is still in the logarithmic domain and step 82 provides a logarithmic domain to linear domain transform for each high resolution spectral value.

In step 83, the linear SNR values for each spectral value (i.e. at high resolution) are smoothed over time and frequency, for example using an IIR low pass filter, or alternatively an FIR low pass filter, for example any moving average operation may be applied. In step 84, a spectral weight for each high resolution frequency value is calculated based on the smoothed linear SNR value. This calculation relies on the function shown in fig. 5, although the function shown in this figure is specified as a logarithmic term, whereas the spectral weight for each high resolution frequency value in step 84 is calculated in the linear domain.

In step 85, each spectral value is then multiplied by the determined spectral weight to obtain a set of high resolution spectral values, which are multiplied by the set of spectral weights. The frequency-time transform is performed on this processed spectrum in step 86. Depending on the application scenario and on the overlap employed in step 80, a cross-fading operation may be performed between two blocks of time-domain audio sample values obtained by two subsequent frequency-time transformations to account for block artifacts.

Additional windows may be applied to reduce the artifacts of the circular convolution.

The result of step 86 is a block of audio sample values having improved speech performance, i.e. a better perception of speech compared to a corresponding input signal without speech enhancement.

Depending on the particular application requirements of the inventive method, the inventive method may also be implemented in hardware or software. The implementation can be performed using a digital storage medium, in particular a disc, DVD or CD having electronically readable control signals stored thereon, which cooperate with a programmable computer system such that the inventive methods are performed. In general, the invention is therefore also a computer program product with a program code stored on a machine-readable carrier for performing the inventive methods when the computer program product runs on a computer. In other words, the invention can thus also be implemented as a computer program having a program code for performing the method, when the computer program product is executed on a computer.

While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that: there are many alternative ways of implementing the methods and compositions of the present invention. Therefore, it is desirable that: it is intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.

Claims

1. An apparatus for processing an audio signal to obtain control information for a speech enhancement filter, comprising:

a feature extractor for obtaining a timing of a short-time spectral representation of the audio signal and for extracting at least one feature in each of a plurality of frequency bands for the plurality of short-time spectral representations, the at least one feature representing a spectral shape of the short-time spectral representation in a frequency band of the plurality of frequency bands, wherein the feature extractor is operable to extract at least one additional feature representing a characteristic of the short-time spectral representation that is different from the spectral shape; and

a feature combiner for combining at least one feature for each frequency band using a combining parameter to obtain control information for a speech enhancement filter for a time portion of the audio signal, wherein the feature combiner is operable to combine the at least one additional feature with the at least one feature for each frequency band using the combining parameter.

2. Apparatus in accordance with claim 1, in which the feature extractor is operative to apply a frequency transform operation in which a sequence of spectral representations is obtained for a sequence of time instants, the spectral representations having frequency bands with non-uniform bandwidths, the bandwidths being larger as a center frequency of the frequency bands increases.

3. Apparatus in accordance with claim 1, in which the feature extractor is operative to calculate as the first feature a spectral flatness measure for each band representing an energy distribution in the band, or to calculate as the second feature a normalized energy measure for each band, the normalization being based on a total energy of a signal frame from which the spectral representation is derived, and

wherein the feature extractor is operable to employ the spectral flatness measure for a band or a normalized energy for each band.

4. Apparatus in accordance with claim 1, in which the feature extractor is operative to additionally extract a spectral flux measure for each band representing a similarity or dissimilarity between temporally successive spectral representations or to extract a spectral bias measure representing an asymmetry around a centroid.

5. Apparatus in accordance with claim 1, in which the feature extractor is operative to additionally extract LPC features comprising an LPC error signal, linear prediction coefficients up to a predetermined order or a combination of the LPC error signal and the linear prediction coefficients, or in which the feature extractor is operative to additionally extract PLP coefficients or RASTA-PLP coefficients or mel-frequency cepstral coefficients or Delta features.

6. Apparatus in accordance with claim 5, in which the feature extractor is operative to calculate linear prediction coefficient features for a block of time-domain audio samples, the block comprising audio samples for extracting the at least one feature representing a spectral shape for each frequency band.

7. Apparatus in accordance with claim 1, in which the feature extractor is operative to calculate a shape of a spectrum in a frequency band using only spectral information of the frequency band and spectral information of one or two directly adjacent frequency bands of the frequency band.

8. Apparatus in accordance with claim 1, in which the feature extractor is operative to extract raw feature information for each feature of each block of audio samples and to merge a sequence of raw feature information in a frequency band to obtain the at least one feature for the frequency band.

9. Apparatus in accordance with claim 1, in which the feature extractor is operative to calculate a plurality of spectral values for each frequency band and to combine the plurality of spectral values to obtain the at least one feature representing the spectral shape such that the at least one feature has a number of dimensions smaller than the number of spectral values in the frequency band.

10. A method of processing an audio signal to obtain control information for a speech enhancement filter, comprising:

obtaining a timing of a short-time spectral representation of the audio signal;

extracting at least one feature for each of a plurality of frequency bands of a plurality of short-time spectral representations, the at least one feature representing a spectral shape of a short-time spectral representation in a frequency band of the plurality of frequency bands, wherein at least one additional feature is extracted, the at least one additional feature representing a characteristic of the short-time spectral representation that is different from the spectral shape; and

combining at least one feature for each frequency band using a combining parameter to obtain the control information for the speech enhancement filter for the time portion of the audio signal, wherein the at least one additional feature is combined with the at least one feature for each frequency band using the combining parameter.

11. An apparatus for speech enhancement in an audio signal, comprising:

means for processing the audio signal to obtain filter control information for a plurality of bands representing temporal portions of the audio signal, the means for processing comprising:

a feature extractor for obtaining a timing of a short-time spectral representation of the audio signal and for extracting at least one feature in each of a plurality of frequency bands for the plurality of short-time spectral representations, the at least one feature representing a spectral shape of the short-time spectral representation in a frequency band of the plurality of frequency bands; and

a feature combiner for combining at least one feature for each frequency band using combining parameters to obtain filter control information for a controllable filter for a time portion of the audio signal; and

the controllable filter being controllable so that a band of the audio signal is variably attenuated with respect to a different band based on the filter control information.

12. The apparatus of claim 11, wherein the means for processing the audio signal comprises a time-frequency transformer providing spectral information, the control information being provided to the time-frequency transformer, the spectral information having a resolution higher than a spectral resolution; and

wherein the apparatus additionally comprises a control information post-processor for interpolating the control information to the higher resolution than the spectral resolution and smoothing the interpolated control information to obtain post-processed control information, setting controllable filter parameters of the controllable filter based on the post-processed control information.

13. A method of speech enhancement in an audio signal, comprising:

method for processing an audio signal for obtaining filter control information for a plurality of bands for representing a time portion of the audio signal, the method comprising:

obtaining a timing of a short-time spectral representation of the audio signal;

extracting at least one feature for each of a plurality of frequency bands of a plurality of short-time spectral representations, the at least one feature representing a spectral shape of a short-time spectral representation in a frequency band of the plurality of frequency bands; and

combining at least one feature for each frequency band using combining parameters to obtain the filter control information for a time portion of an audio signal; and

controlling a controllable filter such that a band of the audio signal is variably attenuated with respect to a different band based on the filter control information.

14. An apparatus for training a feature combiner for determining combining parameters of the feature combiner, comprising:

a feature extractor for obtaining a timing of a short-time spectral representation of a training audio signal for which control information of a speech enhancement filter for each frequency band is known, and for extracting at least one feature in each of a plurality of frequency bands for a plurality of short-time spectral representations, the at least one feature representing a spectral shape of the short-time spectral representation in the frequency band of the plurality of frequency bands; and

an optimization controller for providing the at least one feature for each frequency band to the feature combiner, for calculating the control information using intermediate combining parameters, for changing the intermediate combining parameters to obtain changed control information, for comparing the changed control information with known control information, and for updating the intermediate combining parameters when the changed intermediate combining parameters yield control information that better matches the known control information.

15. A method for training a feature combiner for determining combining parameters of the feature combiner, comprising:

obtaining a timing of a short-time spectral representation of a training audio signal for which control information for a speech enhancement filter for each frequency band is known;

extracting at least one feature in each of the plurality of frequency bands for a plurality of short-time spectral representations, the at least one feature representing a spectral shape of a short-time spectral representation in a frequency band of the plurality of frequency bands;

providing the at least one feature for each frequency band to the feature combiner;

calculating the control information by adopting an intermediate merging parameter;

changing the intermediate combining parameter to obtain changed control information;

comparing the changed control information with known control information;

updating the intermediate combining parameters when the changed intermediate combining parameters result in control information that better matches the known control information.