HK1184589B - Systems, methods, and apparatus for wideband encoding and decoding of inactive frames - Google Patents
Systems, methods, and apparatus for wideband encoding and decoding of inactive frames Download PDFInfo
- Publication number
- HK1184589B HK1184589B HK13111834.2A HK13111834A HK1184589B HK 1184589 B HK1184589 B HK 1184589B HK 13111834 A HK13111834 A HK 13111834A HK 1184589 B HK1184589 B HK 1184589B
- Authority
- HK
- Hong Kong
- Prior art keywords
- frame
- encoded
- description
- frequency band
- speech signal
- Prior art date
Links
Abstract
Systems, methods, and apparatus for wideband encoding and decoding of inactive frames. Speech encoders and methods of speech encoding are disclosed that encode inactive frames at different rates. Apparatus and methods for processing an encoded speech signal are disclosed that calculate a decoded frame based on a description of a spectral envelope over a first frequency band and the description of a spectral envelope over a second frequency band, in which the description for the first frequency band is based on information from a corresponding encoded frame and the description for the second frequency band is based on information from at least one preceding encoded frame. Calculation of the decoded frame may also be based on a description of temporal information for the second frequency band that is based on information from at least one preceding encoded frame.
Description
Related information of divisional application
The present application is a divisional application of the original chinese invention patent application entitled "system, method and apparatus for wideband encoding and decoding of invalid frames". The original application having application number 200780027806.8; the filing date of the original application is 2007, 7 and 31.
Related application
The present application claims priority from united states provisional patent application No. 60/834,688 entitled uplink discontinuous transmission scheme (upperbanddstxscheme), filed on 31/7/2006.
Technical Field
The present invention relates to the processing of speech signals.
Background
Voice transmission over digital technology has become more common, especially in long-range telephony, packet-switched telephony such as voice over IP (also known as VoIP, where IP stands for internet protocol), and digital radio telephony such as cellular telephony. This rapid spread has created interest in reducing the amount of information used to communicate voice communications over a transmission channel while maintaining the perceived quality of the reconstructed speech.
A device configured to compress speech by extracting parameters related to a model of human speech generation is called a "speech coder. A speech coding apparatus typically comprises an encoder and a decoder. An encoder typically divides an incoming speech signal (a digital signal representing audio information) into segments of time called "frames," analyzes each frame to extract certain relevant parameters, and quantizes the parameters into an encoded frame. The encoded frames are transmitted via a transmission channel (i.e., a wired or wireless network connection) to a receiver that includes a decoder. The decoder receives and processes the encoded frames, dequantizes them to generate parameters, and reconstructs speech frames using the dequantized parameters.
In a typical conversation, each speaker is silent for about sixty percent of the time. Speech encoders are typically configured to distinguish frames of a speech signal that contain speech ("active frames") from frames of the speech signal that contain only silence or background noise ("inactive frames"). Such an encoder may be configured to encode valid and invalid frames using different encoding modes and/or rates. For example, speech encoders are typically configured to encode inactive frames using fewer bits than are used to encode active frames. A speech coder may use a lower bit rate for inactive frames to support speech signaling at a lower average bit rate with little to no loss in perceptual quality.
Fig. 1 illustrates the result of encoding a region of a speech signal that includes a transition between an active frame and an inactive frame. Each bar in the drawing indicates a corresponding frame, with the height of the bar indicating the bit rate at which the frame is encoded and the horizontal axis indicating time. In this case, active frames are encoded at a higher bit rate rH and inactive frames are encoded at a lower bit rate rL.
Examples of the bit rate rH include 171 bits per frame, 80 bits per frame, and 40 bits per frame; and an example of the bit rate rL includes 16 bits per frame. In the case of cellular telephone systems, particularly systems compliant with the Interim Standard (IS) -95 or similar industry standards as promulgated by the telecommunications industry association, Arlington, VA, these four bit rates are also referred to as "full rate", "half rate", "quarter rate", and "eighth rate", respectively. In one particular example of the results shown in fig. 1, the rate rH is full rate and the rate rL is eighth rate.
Voice communications over the Public Switched Telephone Network (PSTN) have traditionally been limited in bandwidth to the frequency range of 300 to 3400 kilohertz (kHz). More recent networks for voice communications, such as networks using cellular telephones and/or VoIP, may not have the same bandwidth limitations and devices using such networks may be required to have the ability to transmit and receive voice communications that include a wide band frequency range. For example, such an apparatus may be required to support an audio frequency range extending down to 50Hz and/or up to 7 or 8 kHz. It may also be desirable for such devices to support other applications, such as high quality audio or audio/video conferencing, delivery of multimedia services such as music and/or television, etc., which may have audio voice content in a range outside traditional PSTN limits.
Extension of the range supported by a speech coding device into higher frequencies may improve intelligibility. For example, information in a speech signal that distinguishes fricatives such as "s" and "f" is primarily in higher frequencies. Highband extension may also improve other qualities of the decoded speech signal, such as realism. For example, even voiced vowels may have spectral energy well above the PSTN frequency range.
While it may be desirable for a speech coding device to support a wideband frequency range, it is also desirable to limit the amount of information used to communicate voice communications over a transmission channel. A speech coder may be configured to perform, for example, Discontinuous Transmission (DTX) such that a description is not transmitted for all inactive frames of a speech signal.
Disclosure of Invention
A method of encoding frames of a speech signal according to a configuration comprises: generating a first encoded frame that is based on a first frame of a speech signal and has a length of p bits, where p is a non-zero positive integer; generating a second encoded frame that is based on a second frame of the speech signal and has a length of q bits, where q is a non-zero positive integer other than p; and generating a third encoded frame that is based on a third frame of the speech signal and has a length of r bits, where r is a non-zero positive integer less than q. In this method, the second frame is an inactive frame following the first frame in the speech signal, the third frame is an inactive frame following the second frame in the speech signal, and all frames of the speech signal between the first and third frames are inactive.
A method of encoding frames of a speech signal according to another configuration includes generating a first encoded frame that is based on a first frame of the speech signal and has a length of q bits, where q is a non-zero positive integer. This method also includes generating a second encoded frame that is based on a second frame of the speech signal and has a length of r bits, where r is a non-zero positive integer less than q. In this method, the first and second frames are invalid frames. In this method, the first encoded frame comprises (a) a description of a spectral envelope of a portion of the speech signal comprising the first frame over a first frequency band and (B) a description of a spectral envelope of a portion of the speech signal comprising the first frame over a second frequency band different from the first frequency band, and the second encoded frame comprises (a) a description of a spectral envelope of a portion of the speech signal comprising the second frame over the first frequency band and (B) no description of a spectral envelope over the second frequency band. Devices for performing such operations are also expressly contemplated and disclosed herein. Computer program products comprising a computer-readable medium comprising code for causing at least one computer to perform such operations are also expressly contemplated and disclosed herein. Apparatus comprising a voice activity detector, a coding scheme selector, and a speech encoder configured to perform such operations are also expressly contemplated and disclosed herein.
An apparatus for encoding frames of a speech signal according to another configuration includes: means for generating a first encoded frame having a length of p bits based on a first frame of a speech signal, wherein p is a non-zero positive integer; means for generating a second encoded frame having a length of q bits based on a second frame of the speech signal, wherein q is a non-zero positive integer different from p; and means for generating a third encoded frame having a length of r bits based on a third frame of the speech signal, where r is a non-zero positive integer less than q. In this apparatus, the second frame is an invalid frame following the first frame in the speech signal, the third frame is an invalid frame following the second frame in the speech signal, and all frames of the speech signal between the first and third frames are invalid.
A computer program product according to another configuration includes a computer-readable medium. The media includes: code for causing at least one computer to generate a first encoded frame that is based on a first frame of a speech signal and has a length of p bits, where p is a non-zero positive integer; code for causing at least one computer to generate a second encoded frame that is based on a second frame of the speech signal and has a length of q bits, where q is a non-zero positive integer other than p; and code for causing at least one computer to generate a third encoded frame that is based on a third frame of the speech signal and has a length of r bits, where r is a non-zero positive integer less than q. In this product, the second frame is an inactive frame following the first frame in the speech signal, the third frame is an inactive frame following the second frame in the speech signal, and all frames of the speech signal between the first and third frames are inactive.
An apparatus for encoding frames of a speech signal according to another configuration includes: a voice activity detector configured to indicate, for each of a plurality of frames of a voice signal, whether the frame is valid or invalid; a coding scheme selector; and a speech encoder. The coding scheme selector is configured to (a) select a first coding scheme in response to an indication of a first frame of the speech signal by the speech activity detector; (B) selecting a second coding scheme for a second frame that is one of a consecutive series of inactive frames following the first frame in the speech signal and in response to an indication by the speech activity detector that the second frame is inactive; and (C) selecting a third encoding scheme for a third frame that follows the second frame in the speech signal and is another of a consecutive series of inactive frames that follows the first frame in the speech signal, and in response to an indication by the speech activity detector that the third frame is inactive. The speech encoder is configured to (D) generate a first encoded frame according to a first encoding scheme, the first encoded frame being based on a first frame and having a length of p bits, where p is a non-zero positive integer; (E) generating a second encoded frame according to a second coding scheme, the second encoded frame being based on a second frame and having a length of q bits, wherein q is a non-zero positive integer other than p; and (F) generating a third encoded frame according to a third encoding scheme, the third encoded frame being based on the third frame and having a length of r bits, where r is a non-zero positive integer less than q.
A method of processing an encoded speech signal according to one configuration includes obtaining, based on information from a first encoded frame of the encoded speech signal, a description of a spectral envelope of the first frame of the speech signal over (a) a first frequency band and (B) a second frequency band different from the first frequency band. This method also includes obtaining, based on information from a second frame of the encoded speech signal, a description of a spectral envelope of the second frame of the speech signal over the first frequency band. This method also includes obtaining, based on information from the first encoded frame, a description of a spectral envelope of the second frame over the second frequency band.
An apparatus for processing an encoded speech signal according to another configuration includes means for obtaining, based on information from a first encoded frame of the encoded speech signal, a description of a spectral envelope of the first frame of the speech signal over (a) a first frequency band and (B) a second frequency band different from the first frequency band. This apparatus also includes means for obtaining, based on information from a second encoded frame of the encoded speech signal, a description of a spectral envelope of the second frame of the speech signal over the first frequency band. This apparatus also includes means for obtaining, based on information from the first encoded frame, a description of a spectral envelope of the second frame over the second frequency band.
A computer program product according to another configuration includes a computer-readable medium. The medium includes code for causing at least one computer to obtain, based on information from a first encoded frame of an encoded speech signal, a description of a spectral envelope of the first frame of the speech signal over (a) a first frequency band and (B) a second frequency band different from the first frequency band. This medium also includes code for causing at least one computer to obtain, based on information from a second encoded frame of the encoded speech signal, a description of a spectral envelope of the second frame of the speech signal over the first frequency band. This medium also includes code for causing at least one computer to obtain, based on information from the first encoded frame, a description of a spectral envelope of the second frame over the second frequency band.
The apparatus for processing an encoded speech signal according to another configuration comprises control logic configured to generate a control signal comprising a sequence of values that is based on an encoding index of an encoded frame of the encoded speech signal, each value in the sequence corresponding to an encoded frame of the encoded speech signal. This apparatus also includes a speech decoder configured to calculate, in response to a value of the control signal having a first state, a decoded frame based on a description of a spectral envelope over the first and second frequency bands, the description being based on information from a corresponding encoded frame. The speech decoder is also configured to, in response to a value of a control signal having a second state different from the first state, calculate a decoded frame based on the following description: (1) a description of a spectral envelope over a first frequency band, the description based on information from a corresponding encoded frame, and (2) a description of a spectral envelope over a second frequency band, the description based on information from at least one encoded frame that occurs in an encoded speech signal before the corresponding encoded frame.
Drawings
Fig. 1 illustrates the result of encoding a region of a speech signal that includes a transition between an active frame and an inactive frame.
FIG. 2 shows one example of a decision tree that a speech encoder or speech encoding method may use to select a bit rate.
Fig. 3 illustrates the result of encoding a deferred region of a speech signal comprising four frames.
FIG. 4A shows a graph of a trapezoidal windowing function that may be used to calculate gain shape values.
Fig. 4B shows the application of the windowing function of fig. 4A to each of the five sub-frames of a frame.
FIG. 5A shows one example of a non-overlapping band scheme that may be used by a banded encoder to encode wideband speech content.
FIG. 5B shows one example of an overlapping band scheme that may be used by a banded encoder to encode wideband speech content.
6A, 6B, 7A, 7B, 8A and 8B illustrate the results of encoding the transition from active to inactive frames in a speech signal using several different methods.
Fig. 9 illustrates the operation of encoding three successive frames of a speech signal using method M100 according to the general configuration.
10A, 10B, 11A, 11B, 12A, and 12B illustrate the results of encoding a transition from an active frame to an inactive frame using different implementations of method M100.
Fig. 13A shows the result of encoding a sequence of frames according to another implementation of method M100.
Fig. 13B illustrates the result of encoding a series of invalid frames using yet another implementation of method M100.
FIG. 14 shows an application of an implementation M110 of method M100.
FIG. 15 shows an application of an implementation M120 of method M110.
FIG. 16 shows an application of an implementation M130 of method M120.
FIG. 17A illustrates the result of encoding a transition from an active frame to an inactive frame using an implementation of method M130.
Fig. 17B illustrates the result of encoding a transition from an active frame to an inactive frame using another implementation of method M130.
FIG. 18A is a table showing a set of three different encoding schemes that a speech encoder may use to produce the results shown in FIG. 17B.
FIG. 18B illustrates the operation of encoding two successive frames of a speech signal using method M300 according to the general configuration.
FIG. 18C shows an application of an implementation M310 of method M300.
FIG. 19A shows a block diagram of an apparatus 100 according to a general configuration.
FIG. 19B shows a block diagram of an implementation 132 of the speech encoder 130.
Fig. 19C shows a block diagram of an implementation 142 of the spectral envelope description calculator 140.
FIG. 20A shows a flow diagram of a test that may be performed by an implementation of the encoding scheme selector 120.
Fig. 20B shows a state diagram according to which another implementation of the coding scheme selector 120 may be configured to operate.
21A, 21B, and 21C show state diagrams in which other implementations of the encoding scheme selector 120 may be configured to operate according to.
FIG. 22A shows a block diagram of an implementation 134 of speech encoder 132.
Fig. 22B shows a block diagram of an implementation 154 of the temporal information description calculator 152.
FIG. 23A shows a block diagram of an implementation 102 of apparatus 100, the implementation 102 configured to encode a wideband speech signal according to a banded encoding scheme.
FIG. 23B shows a block diagram of an implementation 138 of the speech encoder 136.
FIG. 24A shows a block diagram of an implementation 139 of wideband speech encoder 136.
FIG. 24B shows a block diagram of an implementation 158 of the temporal description calculator 156.
FIG. 25A shows a flowchart of a method M200 of processing an encoded speech signal according to a general configuration.
FIG. 25B shows a flowchart of an implementation M210 of method M200.
FIG. 25C shows a flowchart of an implementation M220 of method M210.
FIG. 26 shows an application of method M200.
FIG. 27A illustrates the relationship between methods M100 and M200.
FIG. 27B illustrates the relationship between methods M300 and M200.
FIG. 28 shows an application of method M210.
FIG. 29 shows an application of method M220.
FIG. 30A illustrates the results of an embodiment of iterative task T230.
FIG. 30B illustrates the results of another embodiment of iterative task T230.
FIG. 30C illustrates the results of yet another embodiment of iterative task T230.
FIG. 31 shows a portion of a state diagram for a speech decoder configured to perform an implementation of method M200.
FIG. 32A shows a block diagram of an apparatus 200 for processing an encoded speech signal according to a general configuration.
FIG. 32B shows a block diagram of an implementation 202 of apparatus 200.
FIG. 32C shows a block diagram of an implementation 204 of apparatus 200.
FIG. 33A shows a block diagram of an implementation 232 of the first module 230.
Fig. 33B shows a block diagram of an implementation 272 of a spectral envelope description decoder 270.
FIG. 34A shows a block diagram of an implementation 242 of the second module 240.
FIG. 34B shows a block diagram of an implementation 244 of the second module 240.
FIG. 34C shows a block diagram of an implementation 246 of the second module 242.
FIG. 35A shows a state diagram in which an implementation of control logic 210 may be configured to operate according to.
Fig. 35B shows the results of one example of combining method M100 with DTX.
In the drawings and accompanying description, the same reference numbers refer to the same or similar elements or signals.
Detailed Description
The configurations described herein may be applied in a wideband speech coding system to support using a lower bit rate for inactive frames than used for active frames and/or to improve the perceptual quality of the transmitted speech signal. It is expressly contemplated and hereby disclosed that such configurations may be applicable in packet-switched networks (e.g., wired and/or wireless networks arranged to carry voice transmissions according to protocols such as VoIP) and/or circuit-switched networks.
Unless expressly limited by context, the term "calculating" is used herein to indicate any of its ordinary meanings, such as operating, evaluating, generating, and/or selecting from a set of values. Unless expressly limited by context, the term "obtaining" is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Where the term "comprising" is used in the present description and claims, it does not exclude other elements or operations. The term "a is based on B" is used to indicate any of its usual meanings, including the following: (i) "A is based on at least B" and (ii) "A equals B" (if appropriate in the specific context).
Unless otherwise indicated, any disclosure of a speech encoder having a particular feature is also expressly intended to disclose a method of speech encoding having a similar feature (and vice versa), and any disclosure of a speech encoder according to a particular configuration is also expressly intended to disclose a method of speech encoding according to a similar configuration (and vice versa). Unless otherwise indicated, any disclosure of a speech decoder having a particular feature is also expressly intended to disclose a speech decoding method having a similar feature (and vice versa), and any disclosure of a speech decoder according to a particular configuration is also expressly intended to disclose a speech decoding method according to a similar configuration (and vice versa).
The frames of a speech signal are typically short enough that the spectral envelope of the signal can be expected to remain relatively stationary throughout the frame. One typical frame length is 20 milliseconds, but any frame length deemed suitable for the particular application may be used. A frame length of 20 milliseconds corresponds to 140 samples at a sampling rate of 7 kilohertz (kHz), 160 samples at a sampling rate of 8kHz, and 320 samples at a sampling rate of 16kHz, although any sampling rate deemed suitable for a particular application may be used. Another example of a sampling rate that may be used for speech encoding is 12.8kHz, and other examples include other rates within the range of 12.8kHz to 38.4 kHz.
Typically, all frames have the same length, and a consistent frame length is assumed in the specific examples described herein. However, it is also expressly contemplated and hereby disclosed that inconsistent frame lengths may be used. For example, implementations of methods M100 and M200 may also be used in applications that employ different frame lengths for active and inactive frames and/or for voiced and unvoiced frames.
In some applications, the frames are non-overlapping, while in other applications, an overlapping frame scheme is used. For example, speech coding devices typically use an overlapping frame scheme at the encoder and a non-overlapping frame scheme at the decoder. It is also possible for the encoder to use different frame schemes for different tasks. For example, a speech encoder or speech encoding method may encode a description of a spectral envelope of a frame using one overlapping frame scheme and encode a description of temporal information of a frame using a different overlapping frame scheme.
As mentioned above, it may be desirable to configure a speech encoder to use different coding modes and/or rates to encode active and inactive frames. To distinguish between active and inactive frames, a speech encoder typically includes a speech activity detector or otherwise performs a method of detecting speech activity. Such a detector or method may be configured to classify a frame as valid or invalid based on one or more factors such as frame energy, signal-to-noise ratio, periodicity, and zero-crossing rate. Such classification may include comparing the value or magnitude of such factor to a threshold and/or comparing the magnitude of the change in such factor to a threshold.
The voice activity detector or method of detecting voice activity may also be configured to classify the active frame as one of two or more different types, such as voiced (e.g., representing a vowel sound), unvoiced (e.g., representing a fricative), or transitional (e.g., representing the beginning or end of a word). A speech encoder may be required to use different bit rates to encode different types of active frames. Although the specific example of fig. 1 shows a series of active frames that are all encoded at the same bitrate, those skilled in the art will appreciate that the methods and apparatus described herein may also be used in speech encoders and speech encoding methods that are configured to encode active frames at different bitrates.
FIG. 2 shows one example of a decision tree that a speech encoder or speech encoding method may use to select the bit rate at which a particular frame is encoded according to the type of speech contained in the frame. In other cases, the bit rate selected for a particular frame may also depend on criteria such as a desired average bit rate, a desired pattern of bit rates over a series of frames that may be used to support the desired average bit rate, and/or a bit rate selected for a previous frame.
Different coding modes may need to be used to encode different types of speech frames. Frames of voiced speech tend to have long-term (i.e., lasting more than one frame period) and pitch-dependent periodic structures, and it is often more efficient to encode a voiced frame (or sequence of voiced frames) by using an encoding mode that encodes a description of this long-term spectral feature. Examples of such coding modes include Code Excited Linear Prediction (CELP) and Prototype Pitch Period (PPP). Unvoiced and inactive frames, on the other hand, typically lack any significant long-term spectral features, and the speech encoder may be configured to encode these frames by using an encoding mode that does not attempt to describe such features. Noise-excited linear prediction (NELP) is an example of this coding mode.
A speech encoder or speech encoding method may be configured to select among different combinations of bit rates and encoding modes (also referred to as "encoding schemes"). For example, a speech encoder configured to perform an implementation of method M100 may use a full-rate CELP scheme for frames containing voiced speech and transition frames, a half-rate NELP scheme for frames containing unvoiced speech, and an eighth-rate NELP scheme for inactive frames. Other examples of such a speech encoder support multiple coding rates for one or more coding schemes, such as full-rate and half-rate CELP schemes and/or full-rate and quarter-rate PPP schemes.
The transition from active speech to inactive speech typically occurs over a period of several frames. Thus, the first few frames of the speech signal after the transition from the active frame to the inactive frame may include a residue of the active speech, such as a voicing residue. If a speech encoder encodes a frame with such a residual using an encoding scheme intended for invalid frames, the encoding result may not accurately represent the original frame. Thus, it may be desirable to continue using a higher bit rate and/or active encoding mode for one or more of the frames following the transition from active to inactive frames.
Fig. 3 illustrates the result of encoding a region of a speech signal, where the higher bit rate rH continues to be used for a number of frames following the transition from active to inactive frames. The length of this continuation (also referred to as "postponement") may be selected according to the expected length of the transition and may be fixed or variable. For example, the length of the delay may be based on one or more characteristics, such as signal-to-noise ratio, of one or more of the valid frames prior to the transition. Fig. 3 illustrates a deferral having four frames.
The encoded frames typically contain a set of speech parameters from which corresponding frames of the speech signal can be reconstructed. This set of speech parameters typically includes spectral information, such as a description of the energy distribution over a spectrum within the frame. This energy distribution is also referred to as the "frequency envelope" or "spectral envelope" of the frame. Speech coders are typically configured to compute a description of a spectral envelope of a frame as an ordered sequence of values. In some cases, the speech encoder is configured to calculate the ordered sequence such that each value is indicative of an amplitude or magnitude of the signal at a corresponding frequency or over a corresponding spectral region. One example of this description is an ordered sequence of fourier transform coefficients.
In other cases, the speech encoder is configured to calculate the description of the spectral envelope as an ordered sequence of parameter values of a coding model (e.g., a set of coefficient values for Linear Predictive Coding (LPC) analysis). The ordered sequence of LPC coefficient values is typically arranged into one or more vectors, and a speech encoder may be implemented to calculate these values as filter coefficients or reflection coefficients. The number of coefficient values in the set is also referred to as the "order" of the LPC analysis, and examples of typical orders of LPC analysis as performed by a speech encoder of a communication device, such as a cellular telephone, include 4,6, 8, 10, 12, 16, 20, 24, 28 and 32.
Speech coding devices are typically configured to transmit a description of a spectral envelope in quantized form over a transmission channel (e.g., as one or more indices into a corresponding look-up table or "codebook"). Thus, the speech encoder may be required to compute a set of LPC coefficient values in a form that can be efficiently quantized, such as a set of values for Line Spectral Pairs (LSP), Line Spectral Frequencies (LSF), Immittance Spectral Pairs (ISP), Immittance Spectral Frequencies (ISF), cepstral coefficients, or log-area ratios. The speech encoder may also be configured to perform other operations on the ordered sequence of values prior to conversion and/or quantization, such as perceptual weighting.
In some cases, the description of the spectral envelope of a frame also includes a description of temporal information of the frame (e.g., in the form of an ordered sequence of fourier transform coefficients). In other cases, the set of speech parameters for an encoded frame may also include a description of temporal information for the frame. The form of the description of the temporal information may depend on the particular coding mode used to encode the frame. For some coding modes (e.g., for a CELP coding mode), the description of the temporal information may comprise a description of an excitation signal to be used by a speech decoder to excite an LPC model (e.g., as defined by a description of a spectral envelope). The description of the excitation signal typically appears in quantized form in the encoded frame (e.g., as one or more indices into a corresponding codebook). The description of the time information may also include information related to tonal components of the excitation signal. For the PPP encoding mode, for example, the encoded time information may comprise a description of a prototype of a tonal component to be used by a speech decoder to render the excitation signal. A description of information relating to a tonal component typically appears in a coded frame in quantized form (e.g., as one or more indices into a corresponding codebook).
For other coding modes (e.g., for NELP coding modes), the description of temporal information may include a description of the temporal envelope of the frame (also referred to as the "energy envelope" or "gain envelope" of the frame). The description of the temporal envelope may include a value based on the average energy of the frame. This value is typically presented as a gain value to be applied to the frame during decoding, and is also referred to as a "gain frame". In some cases, the gain framework is a normalization factor based on a ratio between: (A) energy E of the original frameOriginal(ii) a And (B) the energy E of the frame synthesized from other parameters of the encoded frame (e.g., including a description of the spectral envelope)Synthesis of. For example, the gain frame may be expressed as EOriginal/ECombination of Chinese herbs Become intoOr is expressed as EOriginal/ESynthesis ofThe square root of (a). Other aspects of the gain framework and time envelope are described in more detail in, for example, U.S. patent application publication No. 2006/0282262 entitled "system, method and apparatus for gain factor attenuation" (waters (Vos) et al), which is published 2006, 12, 14.
Alternatively or additionally, the description of the temporal envelope may comprise a relative energy value for each of a number of subframes of the frame. Such values are typically presented as gain values to be applied to the respective sub-frames during decoding, and are collectively referred to as "gain profiles" or "gain shapes". In some cases, the gain shape value is a normalization factor that is based on a ratio between: (A) energy E of original subframe iOriginal. i(ii) a And (B) the energy E of the corresponding sub-frame i of the frame synthesized from other parameters of the encoded frame (e.g., including a description of the spectral envelope)Synthesis of. In such cases, energy E may be usedSynthesis ofTo enable energy EOriginal. iAnd (6) standardizing. For example, the gain shape value may be expressed as EOriginal. i/ESynthesis ofOr is expressed as EOriginal. i/ESynthesis ofThe square root of (a). One example of a description of the temporal envelope includes a gain frameworkAnd a gain shape, wherein the gain shape comprises values for each of the five 4 millisecond sub-frames of the 20 millisecond frame. The gain values may be expressed on a linear scale or a logarithmic (e.g., decibel) scale. Such features are described in more detail, for example, in U.S. patent application publication No. 2006/0282262, referenced above.
In calculating the values of the gain frame (or the values of the gain shape), it may be necessary to apply a windowing function that overlaps with adjacent frames (or sub-frames). The gain values generated in this manner are typically applied at the speech decoder in an overlap-add manner, which may help reduce or avoid discontinuities between frames or subframes. Fig. 4A shows a graph of a trapezoidal windowing function that may be used to calculate each of the gain shape values. In this example, the window overlaps each of two adjacent subframes by 1 millisecond. Fig. 4B shows the application of this windowing function to each of the five sub-frames of a20 millisecond frame. Other examples of windowing functions include functions having different overlap periods and/or different window shapes (e.g., rectangles or hamms) that may be symmetric or asymmetric. It is also possible to calculate the value of the gain shape by applying different windowing functions for different sub-frames and/or by calculating different values of the gain shape over sub-frames having different lengths.
A coded frame that includes a description of a temporal envelope typically includes this description in quantized form as one or more indices into the corresponding codebook, but in some cases an algorithm may be used to quantize and/or dequantize the gain frame and/or gain shape without using the codebook. One example of a description of a temporal envelope includes a quantization index having eight to twelve bits that specifies five gain shape values for a frame (e.g., one gain shape value for each of five consecutive subframes). This description may also include another quantization index that specifies the gain frame value for the frame.
As mentioned above, it may be desirable to transmit and receive voice signals having a frequency range in excess of the PSTN frequency range of 300 to 3400 kHz. One method to encode this signal is to encode the entire extended frequency range as a single frequency band. This method may be implemented by scaling narrowband speech coding techniques (e.g., techniques configured to encode a PSTN quality frequency range such as 0-4 kHz or 300-3400 Hz) to cover a wideband frequency range such as 0-8 kHz. For example, such a method may comprise (a) sampling a speech signal at a higher rate to include high frequency components, and (B) reconfiguring the narrowband coding techniques to represent such a wideband signal to a desired degree of accuracy. One such method of reconfiguring the narrowband coding technique is to use higher order LPC analysis (i.e., to generate coefficient vectors with more values). A wideband speech coding apparatus that codes a wideband signal as a single band is also referred to as a "full band" coding apparatus.
It may be desirable to implement wideband speech encoding means such that at least a narrowband portion of an encoded signal may be sent over a narrowband channel, such as a PSTN channel, without the encoded signal having to be coded or otherwise significantly modified. This feature may facilitate backward compatibility with networks and/or devices that only recognize narrowband signals. It may also be desirable to implement wideband speech coding devices that use different coding modes and/or rates for different frequency bands of the speech signal. This feature may be used to support increased coding efficiency and/or perceptual quality. A wideband speech encoding device configured to generate encoded frames having portions representing different frequency bands of a wideband speech signal (e.g., separate sets of speech parameters, each set representing a different frequency band of the wideband speech signal) is also referred to as a "banded" encoding device.
FIG. 5A shows one example of a non-overlapping band scheme that may be used by a banded encoder to encode wideband speech content spanning a range of 0Hz to 8 kHz. This scheme includes a first frequency band extending from 0Hz to 4kHz (also referred to as the narrowband range) and a second frequency band extending from 4kHz to 8kHz (also referred to as the extended, upper or high-band range). FIG. 5B shows one example of an overlapping band scheme that may be used by a banded encoder to encode wideband speech content spanning a range of 0Hz to 7 kHz. This scheme includes a first frequency band (narrow band range) extending from 0Hz to 4kHz and a second frequency band (extended, upper or high band range) extending from 3.5kHz to 7 kHz.
One particular example of a banded encoder is configured to perform a tenth order LPC analysis on the narrowband range and a sixth order LPC analysis on the high-band range. Other examples of frequency band schemes include examples where the narrow band range extends only down to about 300 Hz. This scheme may also include another frequency band covering a low band range from about 0Hz or 50Hz up to about 300Hz or 350 Hz.
It may be desirable to reduce the average bit rate used to encode a wideband speech signal. For example, reducing the average bit rate required to support a particular service may allow for an increase in the number of users that the network can simultaneously serve. However, it is also desirable to accomplish this reduction without unduly degrading the perceptual quality of the corresponding decoded speech signal.
One possible approach to reducing the average bit rate of a wideband speech signal is to encode inactive frames at a low bit rate using a full-band wideband encoding scheme. Fig. 6A illustrates the result of encoding a transition from active to inactive frames, where active frames are encoded at a higher bit rate rH and inactive frames are encoded at a lower bit rate rL. The label F indicates a frame encoded using a full-band wideband coding scheme.
To achieve a sufficient reduction in average bit rate, it may be necessary to encode the invalid frames using a very low bit rate. For example, it may be desirable to use a bit rate comparable to the rate used to encode the inactive frames in the narrowband encoding device, such as 16 bits per frame ("eighth rate"). Unfortunately, this small number of bits is typically insufficient to encode even an inactive frame of a wideband signal across the wideband range at an acceptable level of perceptual quality, and a full-band wideband encoding device that encodes inactive frames at this rate is likely to produce a decoded signal with poor sound quality during the inactive frames. This signal may lack smoothness during inactive frames, for example, because the perceived loudness and/or spectral distribution of the decoded signal may change excessively between adjacent frames. Smoothness is typically perceptually important for the decoded background noise.
Fig. 6B illustrates another result of encoding a transition from an active frame to an inactive frame. In this case, a banded wideband encoding scheme is used to encode active frames at a higher bit rate and a full wideband encoding scheme is used to encode inactive frames at a lower bit rate. Labels H and N indicate portions of the banded coded frame that are coded using a high band coding scheme and a narrow band coding scheme, respectively. As mentioned above, encoding invalid frames using a full-band wideband encoding scheme and a low bit rate is likely to result in a decoded signal with poor sound quality during the invalid frames. Mixing a banded with a full-band encoding scheme may also increase encoding device complexity, but this complexity may or may not affect the utility of the resulting implementation. In addition, while historical information from past frames is sometimes used to significantly improve coding efficiency (particularly for coding voiced frames), it may not be feasible to apply the historical information produced by the banded coding scheme during operation of the full band coding scheme, and vice versa.
Another possible approach to reduce the average bit rate of the wideband signal is to encode the inactive frames at a low bit rate using a banded wideband encoding scheme. Fig. 7A illustrates the result of encoding a transition from active to inactive frames, where the active frames are encoded at a higher bit rate rH using a full-band wideband encoding scheme and the inactive frames are encoded at a lower bit rate rL using a banded wideband encoding scheme. Fig. 7B illustrates a related example of encoding an active frame using a banded wideband encoding scheme. As mentioned above with reference to fig. 6A and 6B, it may be desirable to encode the invalid frames using a bit rate comparable to the bit rate used to encode invalid frames in the narrowband encoding device, e.g., 16 bits per frame ("eighth rate"). Unfortunately, this small number of bits is typically insufficient for the banded coding scheme to amortize among the different frequency bands so that a decoded wideband signal of acceptable quality may be achieved.
Yet another possible approach to reduce the average bit rate of a wideband signal is to encode invalid frames as a narrowband at a low bit rate. Fig. 8A and 8B illustrate the result of encoding a transition from active to inactive frames, where the active frames are encoded at a higher bit rate rH using a wideband encoding scheme and the inactive frames are encoded at a lower bit rate rL using a narrowband encoding scheme. In the example of fig. 8A, the active frames are encoded using a full-band wideband coding scheme, while in the example of fig. 8B, the active frames are encoded using a split-band wideband coding scheme.
Encoding active frames using a high-bit-rate wideband encoding scheme typically produces encoded frames that contain well-encoded wideband background noise. However, as in the example of fig. 8A and 8B, encoding an invalid frame using only a narrowband encoding scheme results in an encoded frame that lacks extended frequencies. Thus, the transition from a decoded wideband active frame to a decoded narrowband inactive frame may be quite audible and unpleasant, and this third possible approach may also produce sub-optimal results.
Fig. 9 illustrates the operation of encoding three successive frames of a speech signal using method M100 according to the general configuration. Task T110 encodes a first of the three frames (which may be valid or invalid) at a first bitrate r1 (p bits per frame). Task T120 encodes a second frame that follows the first frame and is an invalid frame at a second bitrate, r2, (q bits per frame) that is different from r 1. Task T130 encodes a third frame that immediately follows the second frame and is also inactive at a third bitrate r3 (r bits per frame) that is less than r 2. Method M100 is typically performed as part of a larger speech encoding method, and speech encoders and speech encoding methods configured to perform method M100 are expressly contemplated and hereby disclosed.
A corresponding speech decoder may be configured to use information from the second encoded frame to supplement decoding of invalid frames from the third encoded frame. Elsewhere in this description, speech decoders and methods of decoding frames of speech signals are disclosed that use information from a second encoded frame in decoding one or more subsequent inactive frames.
In the particular example shown in fig. 9, the second frame immediately follows the first frame in the speech signal, and the third frame immediately follows the second frame in the speech signal. In other applications of method M100, the first and second frames may be separated in the speech signal by one or more inactive frames, and the second and third frames may be separated in the speech signal by one or more inactive frames. In the particular example shown in FIG. 9, p is greater than q. Method M100 may also be implemented such that p is less than q. In the particular example shown in fig. 10A-12B, bit rates rH, rM, and rL correspond to bit rates r1, r2, and r3, respectively.
FIG. 10A illustrates the result of encoding a transition from an active frame to an inactive frame using an implementation of method M100 as described above. In this example, the last active frame before the transition is encoded at the higher bit rate rH to generate the first of the three encoded frames, the first inactive frame after the transition is encoded at the intermediate bit rate rM to generate the second of the three encoded frames, and the next inactive frame is encoded at the lower bit rate rL to generate the last of the three encoded frames. In one particular case of this example, the bit rates rH, rM, and rL are full rate, half rate, and eighth rate, respectively.
As mentioned above, the transition from active speech to inactive speech typically occurs over a period of several frames, and the first several frames following the transition from active to inactive frames may include a residual of active speech, such as a voiced residual. If a speech encoder encodes a frame with such a residual using an encoding scheme intended for invalid frames, the encoding result may not accurately represent the original frame. Thus, it may be desirable to implement method M100 to avoid encoding a frame with such a residual as a second encoded frame.
FIG. 10B illustrates the result of encoding a transition from an active frame to an inactive frame using an implementation of method M100 that includes a deferral. This particular example of method M100 continues to use the bit rate rH for the first three inactive frames after the transition. In general, a delay of any desired length may be used (e.g., in the range of from one or two to five or ten frames). The length of the delay may be selected according to the expected length of the transition and may be fixed or variable. For example, the length of the deferral may be based on one or more characteristics, such as signal-to-noise ratio, of one or more of the valid frames prior to the transition and/or one or more of the frames within the deferral. In general, the label "first encoded frame" may be applied to the last valid frame before the transition or to any invalid frame during the deferral.
It may be desirable to implement method M100 to use bit rate r2 over a series of two or more consecutive inactive frames. FIG. 11A illustrates the result of encoding a transition from an active frame to an inactive frame using one such implementation of method M100. In this example, the first and last of the three encoded frames are separated by more than one frame encoded using bit rate rM such that the second encoded frame does not immediately follow the first encoded frame. A corresponding speech decoder may be configured to decode a third encoded frame (and possibly one or more subsequent inactive frames) using information from the second encoded frame.
A speech decoder may be required to use information from more than one encoded frame to decode a subsequent invalid frame. For example, referring to the series as shown in FIG. 11A, a corresponding speech decoder may be configured to decode the third encoded frame (and possibly one or more subsequent inactive frames) using information from two inactive frames encoded at bit rate rM.
It may be generally desirable for the second encoded frame to represent an invalid frame. Thus, method M100 may be implemented to generate a second encoded frame based on spectral information from more than one inactive frame of a speech signal. FIG. 11B illustrates the result of encoding a transition from an active frame to an inactive frame using this implementation of method M100. In this example, the second encoded frame contains information averaged over a window of two frames having a speech signal. In other cases, the averaging window may have a length in the range of two to about six or eight frames. The second encoded frame may comprise a description of the spectral envelope, which is an average of the descriptions of the spectral envelopes of the frames within the window (in this case the corresponding inactive frame of the speech signal and its preceding inactive frame). The second encoded frame may include a description of the temporal information that is based primarily or exclusively on a corresponding frame of the speech signal. Alternatively, method M100 may be configured such that the second encoded frame includes a description of the temporal information, the description being an average of the descriptions of the temporal information of the frames within the window.
Fig. 12A illustrates the result of encoding a transition from an active frame to an inactive frame using another implementation of method M100. In this example, the second encoded frame contains information averaged over a window of three frames, where the second encoded frame is encoded at bit rate rM and the previous two inactive frames are encoded at different bit rates rH. In this particular example, the averaging window follows a three frame post-transition lag. In another example, method M100 may be implemented without such a deferral or alternatively with a deferral that overlaps with the averaging window. In general, the label "first encoded frame" may be applied to the last active frame prior to the transition, to any inactive frame during the extension, or to any frame in the window that is encoded at a different bit rate than the second encoded frame.
In some cases, it may be desirable for an implementation of method M100 to encode invalid frames using bit rate r2 only if the frames follow a sequence of consecutive valid frames having at least a minimum length (also referred to as a "talk spurt"). FIG. 12B illustrates the result of encoding a region of a speech signal using this implementation of method M100. In this example, method M100 is implemented to use bit rate rM to encode the first inactive frame after the transition from active to inactive frames, but only if the previous talk spurt has a length of at least three frames. In such cases, the minimum talk burst length may be fixed or variable. For example, it may be based on characteristics of one or more of the active frames prior to the transition, such as signal-to-noise ratio. Other such implementations of method M100 may also be configured to apply a deferral and/or averaging window as described above.
10A-12B show an application of an implementation of method M100 in which the bit rate r1 used to encode a first encoded frame is greater than the bit rate r2 used to encode a second encoded frame. However, the scope of the implementation of method M100 also includes methods where bit rate r1 is less than bit rate r 2. For example, in some cases, an active frame, such as a voiced frame, may be largely redundant of a previous active frame, and it may be desirable to encode such a frame using a bit rate less than r 2. Fig. 13A shows the result of encoding a sequence of frames according to this implementation of method M100, where an active frame is encoded at a lower bit rate to produce the first of the set of three encoded frames.
The potential application of method M100 is not limited to regions of a speech signal that include transitions from active frames to inactive frames. In some cases, it may be desirable to perform method M100 at some regular interval. For example, it may be desirable to encode every nth frame in a series of consecutive invalid frames at a higher bit rate r2, where typical values for n include 8, 16, and 32. In other cases, method M100 may be initiated in response to an event. One example of such an event is a change in the quality of the background noise, which may be indicated by a change in a parameter related to the spectral tilt, such as the value of the first reflection coefficient. Fig. 13B illustrates the result of encoding a series of invalid frames using this implementation of method M100.
As mentioned above, a wideband frame may be encoded using either a full-band encoding scheme or a split-band encoding scheme. Frames encoded as full-band contain a description of a single spectral envelope that extends over the entire wideband frequency range, while frames encoded as sub-band have two or more separate portions that represent information in different frequency bands (e.g., narrowband range and highband range) of a wideband speech signal. For example, typically, each of these separate portions of a banded encoded frame contains a description of the spectral envelope of the speech signal over the corresponding frequency band. A banded encoded frame may contain one description of the frame's temporal information for the entire wideband frequency range, or each of the separate portions of the encoded frame may contain a description of the speech signal's temporal information for the corresponding band.
FIG. 14 shows an application of an implementation M110 of method M100. Method M110 includes an implementation T112 of task T110 that generates a first encoded frame based on a first of three frames of the speech signal. The first frame may be valid or invalid and the first encoded frame has a length of p bits. As shown in fig. 14, task T112 is configured to generate a first encoded frame to contain a description of a spectral envelope over first and second frequency bands. This description may be a single description extending over the two frequency bands, or it may comprise separate descriptions each extending over a respective one of the frequency bands. Task T112 may also be configured to generate the first encoded frame to contain a description of temporal information (e.g., a temporal envelope) for the first and second frequency bands. This description may be a single description extending over the two frequency bands, or it may comprise separate descriptions each extending over a respective one of the frequency bands.
Method M110 also includes an implementation T122 of task T120 that generates a second encoded frame based on a second of the three frames. The second frame is an invalid frame and the second encoded frame has a length of q bits (where p and q are not equal). As shown in fig. 14, task T122 is configured to generate a second encoded frame to contain a description of the spectral envelope over the first and second frequency bands. This description may be a single description extending over the two frequency bands, or it may comprise separate descriptions each extending over a respective one of the frequency bands. In this particular example, the length in bits of the spectral envelope description contained in the second encoded frame is less than the length in bits of the spectral envelope description contained in the first encoded frame. Task T122 may also be configured to generate the second encoded frame to contain a description of temporal information (e.g., a temporal envelope) for the first and second frequency bands. This description may be a single description extending over the two frequency bands, or it may comprise separate descriptions each extending over a respective one of the frequency bands.
Method M110 also includes an implementation T132 of task T130 that generates a third encoded frame based on the last of the three frames. The third frame is an invalid frame and the third encoded frame has a length of r bits (where r is less than q). As shown in fig. 14, task T132 is configured to generate a third encoded frame to contain a description of a spectral envelope over the first frequency band. In this particular example, the length (in bits) of the spectral envelope description contained in the third encoded frame is less than the length (in bits) of the spectral envelope description contained in the second encoded frame. Task T132 may also be configured to generate a third encoded frame to contain a description of temporal information (e.g., a temporal envelope) for the first frequency band.
The second frequency band is different from the first frequency band, but method M110 may be configured such that the two frequency bands overlap. Examples of the lower limit of the first frequency band include 0, 50, 100, 300, and 500Hz, and examples of the upper limit of the first frequency band include 3, 3.5, 4, 4.5, and 5 kHz. Examples of the lower limit of the second frequency band include 2.5, 3, 3.5, 4, and 4.5kHz, and examples of the upper limit of the second frequency band include 7, 7.5, 8, and 8.5 kHz. All 500 possible combinations of the above-described limits are explicitly contemplated and hereby disclosed, and the application of any such combination to any implementation of method M110 is also explicitly contemplated and hereby disclosed. In one particular example, the first frequency band includes a range of about 50Hz to about 4kHz, and the second frequency band includes a range of about 4Hz to about 7 kHz. In another particular example, the first frequency band includes a range of about 100Hz to about 4kHz, and the second frequency band includes a range of about 3.5Hz to about 7 kHz. In yet another particular example, the first frequency band includes a range of about 300Hz to about 4kHz, and the second frequency band includes a range of about 3.5Hz to about 7 kHz. In these examples, the term "about" indicates plus or minus five percent, with the limits of the respective bands indicated by the respective 3dB points.
As mentioned above, for wideband applications, a split-band coding scheme may have advantages over a full-band coding scheme, such as improved coding efficiency and support for backward compatibility. FIG. 15 shows an application of an implementation M120 of method M110, which implementation M120 uses a banded encoding scheme to generate a second encoded frame. Method M120 includes an implementation T124 of task T122 having two subtasks T126a and T126 b. Task T126a is configured to calculate a description of the spectral envelope over a first frequency band, and task T126b is configured to calculate a separate description of the spectral envelope over a second frequency band. A corresponding speech decoder (e.g., as described below) may be configured to calculate a decoded wideband frame based on information from the spectral envelope description calculated by tasks T126b and T132.
Tasks T126a and T132 may be configured to compute descriptions of the spectral envelope over the first frequency band that have the same length, or one of tasks T126a and T132 may be configured to compute a description that is longer than the description computed by the other task. Tasks T126a and T126b may also be configured to calculate separate descriptions of time information on the two frequency bands.
Task T132 may be configured such that the third encoded frame does not contain any description of the spectral envelope over the second frequency band. Alternatively, task T132 may be configured such that the third encoded frame contains a brief description of the spectral envelope over the second frequency band. For example, task T132 may be configured such that the third encoded frame contains a description of the spectral envelope over the second frequency band that has significantly fewer bits (e.g., no more than half of its length) than the description of the spectral envelope of the third frame over the first frequency band. In another example, task T132 is configured such that the third encoded frame contains a description of the spectral envelope over the second frequency band that has significantly fewer bits (e.g., no more than half of its length) than the description of the spectral envelope over the second frequency band that is computed by task T126 b. In one such example, task T132 is configured to generate the third encoded frame to contain a description of the spectral envelope over the second frequency band that includes only the spectral tilt value (e.g., the normalized first reflection coefficient).
It may be desirable to implement method M110 to generate the first encoded frame using a banded encoding scheme rather than a full-band encoding scheme. FIG. 16 shows an application of an implementation M130 of method M120, the implementation M130 using a banded encoding scheme to generate a first encoded frame. Method M130 includes an implementation T114 of task T110, which includes two subtasks T116a and T116 b. Task T116a is configured to calculate a description of a spectral envelope over a first frequency band, and task T116b is configured to calculate a separate description of the spectral envelope over a second frequency band.
Tasks T116a and T126a may be configured to compute descriptions of the spectral envelope over the first frequency band that have the same length, or one of tasks T116a and T126a may be configured to compute a description that is longer than the description computed by the other task. Tasks T116b and T126b may be configured to compute descriptions of the spectral envelope over the second frequency band that have the same length, or one of tasks T116b and T126b may be configured to compute a description that is longer than the description computed by the other task. Tasks T116a and T116b may also be configured to calculate separate descriptions of time information on the two frequency bands.
FIG. 17A illustrates the result of encoding a transition from an active frame to an inactive frame using an implementation of method M130. In this particular example, the portions of the first and second encoded frames representing the second frequency band have the same length, and the portions of the second and third encoded frames representing the first frequency band have the same length.
It may be desirable for the portion of the second encoded frame representing the second frequency band to have a greater length than the corresponding portion of the first encoded frame. The low and high frequency ranges of the active frame are more likely to be correlated with each other (especially if the active frame is voiced) than the low and high frequency ranges of the inactive frame, which contains background noise. Thus, the high frequency range of the inactive frame may convey relatively more information of the frame than the high frequency range of the active frame, and may require a greater number of bits to be used to encode the high frequency range of the inactive frame.
Fig. 17B illustrates the result of encoding a transition from an active frame to an inactive frame using another implementation of method M130. In this case, the portion of the second encoded frame representing the second frequency band is longer (i.e., has more bits) than the corresponding portion of the first encoded frame. This particular example also shows the case where the portion of the second encoded frame representing the first frequency band is longer than the corresponding portion of the third encoded frame, but another implementation of method M130 may be configured to encode the frame such that these two portions have the same length (e.g., as shown in fig. 17A).
A typical example of method M100 is configured to encode the second frame using a wideband NELP mode (which may be full-band as shown in fig. 14, or banded as shown in fig. 15 and 16) and to encode the third frame using a narrowband NELP mode. The table of FIG. 18 shows a set of three different encoding schemes that a speech encoder can use to produce the results shown in FIG. 17B. In this example, the voiced frames are encoded using a full-rate wideband CELP coding scheme ("coding scheme 1"). This encoding scheme uses 153 bits to encode the narrowband portion of the frame and 16 bits to encode the highband portion. For narrow bands, encoding scheme 1 uses 28 bits to encode the description of the spectral envelope (e.g., as one or more quantized LSP vectors) and 125 bits to encode the description of the excitation signal. For highband, encoding scheme 1 encodes the spectral envelope (e.g., as one or more quantized LSP vectors) using 8 bits and encodes the description of the temporal envelope using 8 bits.
It may be desirable to configure encoding scheme 1 to derive the high-band excitation signal from the narrow-band excitation signal such that no bits of the encoded frame are required to carry the high-band excitation signal. It may also be desirable to configure the encoding scheme 1 to calculate a high-band temporal envelope related to the temporal envelope of the high-band signal as synthesized from other parameters of the encoded frame (e.g., including a description of the spectral envelope over the second frequency band). Such features are described in more detail, for example, in U.S. patent application publication No. 2006/0282262, referenced above.
Compared to voiced speech signals, unvoiced speech signals generally contain more information in the high band that is important for speech understanding. Thus, it may be desirable to use more bits to encode the high-band portion of unvoiced frames than to encode the high-band portion of voiced frames, even for cases where voiced frames are encoded using a higher overall bit rate. In the example according to the table of fig. 18, the silence frame is encoded using a half-rate wideband NELP coding scheme ("coding scheme 2"). Instead of 16 bits as coding scheme 1 used to code the highband portion of a voiced frame, this coding scheme uses 27 bits to code the highband portion of the frame: 12 bits are used to encode a description of the spectral envelope (e.g., as one or more quantized LSP vectors) and 15 bits are used to encode a description of the temporal envelope (e.g., as a quantized gain frame and/or gain shape). To encode the narrowband portion, coding scheme 2 uses 47 bits: 28 bits are used to encode a description of the spectral envelope (e.g., as one or more quantized LSP vectors) and 19 bits are used to encode a description of the temporal envelope (e.g., as quantized gain frames and/or gain shapes).
The scheme described in fig. 18 uses an eighth-rate narrowband NELP coding scheme ("coding scheme 3") to code the inactive frames at a rate of 16 bits per frame, with 10 bits used to code the description of the spectral envelope (e.g., as one or more quantized LSP vectors) and 5 bits used to code the description of the temporal envelope (e.g., as quantized gain frames and/or gain shapes). Another example of coding scheme 3 uses 8 bits to encode the description of the spectral envelope and 6 bits to encode the description of the temporal envelope.
The speech encoder or speech encoding method may be configured to perform an implementation of method M130 using a set of encoding schemes as shown in fig. 18. For example, such an encoder or method may be configured to generate a second encoded frame using encoding scheme 2 instead of encoding scheme 3. Various implementations of such an encoder or method may be configured to produce the results as shown in fig. 10A-13B by using coding scheme 1 indicative of bit rate rH, coding scheme 2 indicative of bit rate rM, and coding scheme 3 indicative of bit rate rL.
For the case where the implementation of method M130 is performed using a set of encoding schemes as shown in FIG. 18, the encoder or method is configured to use the same encoding scheme (scheme 2) to generate the second encoded frame and to generate the encoded unvoiced frame. In other cases, an encoder or method configured to perform an implementation of method M100 may be configured to encode the second frame using a dedicated encoding scheme (i.e., an encoding scheme that the encoder or method does not likewise use to encode the active frame).
An implementation of method M130 that uses a set of coding schemes as shown in fig. 18 is configured to use the same coding mode (i.e., NELP) to generate the second and third encoded frames, but it is possible to use different (e.g., in terms of how the gains are calculated) coding mode versions to generate the two encoded frames. Other configurations of method M100 for generating second and third encoded frames using different encoding modes (e.g., instead of using a CELP mode to generate the second encoded frame) are also expressly contemplated and hereby disclosed. Further configurations of method M100 for generating a second encoded frame using a banded wideband mode using different encoding modes for different frequency bands (e.g., CELP for lower frequency bands and NELP for higher frequency bands, or vice versa) are also expressly contemplated and hereby disclosed. Speech encoders and speech encoding methods configured to perform such implementations of method M100 are also expressly contemplated and thus disclosed.
In a typical application of an implementation of method M100, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions) embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other non-volatile memory cards, semiconductor memory chips, etc.) that is readable and/or executable by a machine (e.g., a computer) that includes an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of method M100 may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communication, such as a cellular telephone or other device having such communication capabilities. Such a device may be configured to communicate with a circuit-switched and/or packet-switched network (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to transmit encoded frames.
FIG. 18B illustrates the operation of encoding two consecutive frames of a speech signal using method M300 according to the general configuration, including tasks T120 and T130, as described herein. (although this implementation of method M300 only processes two frames, the labels "second frame" and "third frame" continue to be used for convenience.) in the particular example shown in FIG. 18B, the third frame immediately follows the second frame. In other applications of method M300, the second and third frames may be separated in the speech signal by an invalid frame or by a consecutive series of two or more invalid frames. In a further application of method M300, the third frame may be any inactive frame of the speech signal that is not the second frame. In another general application of method M300, the second frame may be active or inactive. In another general application of method M300, the second frame may be active or inactive and the third frame may be active or inactive. FIG. 18C shows an application of implementation M310 of method M300, in which tasks T120 and T130 are implemented as tasks T122 and T132, respectively, as described herein. In another embodiment of method M300, task T120 is implemented as task T124, as described herein. Task T132 may need to be configured such that the third encoded frame does not contain any description of the spectral envelope over the second frequency band.
FIG. 19A shows a block diagram of an apparatus 100 configured to perform a method of speech encoding, the method comprising an implementation of method M100 as described herein and/or an implementation of method M300 as described herein. The device 100 comprises a voice activity detector 110, a coding scheme selector 120 and a speech coder 130. The voice activity detector 110 is configured to receive frames of a voice signal and to indicate, for each frame to be encoded, whether the frame is valid or invalid. Coding scheme selector 120 is configured to select a coding scheme for each frame to be encoded in response to an indication by voice activity detector 110. The speech encoder 130 is configured to generate an encoded frame that is based on a frame of the speech signal according to a selected encoding scheme. A communication device, such as a cellular telephone, comprising apparatus 100 may be configured to perform further processing operations on the encoded frames, such as error correction and/or redundancy coding, prior to transmission into a wired, wireless, or optical transmission channel.
Voice activity detector 110 is configured to indicate whether each frame to be encoded is valid or invalid. This indication may be a binary signal such that one state of the signal indicates that the frame is valid and the other state indicates that the frame is invalid. Alternatively, the indication may be a signal having more than two states, such that it may indicate more than one type of valid and/or invalid frame. For example, it may be desirable to configure the detector 110 to: indicating whether the active frame is voiced or unvoiced; or classifying active frames as transitional, voiced, or unvoiced; and it is even possible to classify a transition frame as an up transition or a down transition. A corresponding implementation of coding scheme selector 120 is configured to select a coding scheme for each frame to be encoded in response to these indications.
The voice activity detector 110 may be configured to indicate whether a frame is valid or invalid based on one or more characteristics of the frame, such as energy, signal-to-noise ratio, periodicity, zero-crossing rate, spectral distribution (as evaluated using, for example, one or more LSFs, LSPs, and/or reflection coefficients), etc. To generate the indication, detector 110 may be configured to perform an operation on each of one or more of such characteristics, such as comparing a value or magnitude of such characteristic to a threshold and/or comparing a magnitude of a change in a value or magnitude of such characteristic to a threshold, where the threshold may be fixed or adaptive.
Implementations of voice activity detector 110 may be configured to evaluate the energy of the current frame and indicate that the frame is invalid if the energy value is less than (or, alternatively, not greater than) a threshold. This detector may be configured to calculate the frame energy as a sum of squares of the frame samples. Another implementation of voice activity detector 110 is configured to evaluate the energy of the current frame in each of the low and high frequency bands and indicate that the frame is invalid if the energy value of each frequency band is less than (or, alternatively, not greater than) the respective threshold. This detector may be configured to calculate the frame energy in the frequency band by applying a passband filter to the frame and calculating the sum of the squares of the samples of the filtered frame.
As mentioned above, implementations of voice activity detector 110 may be configured to use one or more thresholds. Each of these values may be fixed or adaptive. The adaptive threshold may be based on one or more factors, such as a noise level of the frame or band, a signal-to-noise ratio of the frame or band, a desired encoding rate, and so forth. In one example, the threshold for each of the low band (e.g., 300Hz to 2kHz) and the high band (e.g., 2kHz to 4kHz) is based on an estimate of the background noise level of the previous frame in that band, the signal-to-noise ratio of the previous frame in that band, and the required average data rate.
Coding scheme selector 120 is configured to select a coding scheme for each frame to be encoded in response to an indication by voice activity detector 110. The coding scheme selection may be based on an indication from voice activity detector 110 for the current frame and/or based on an indication from voice activity detector 110 for each of one or more previous frames. In some cases, coding scheme selection is also based on an indication from voice activity detector 110 for each of one or more subsequent frames.
FIG. 20A shows a flow diagram of a test that may be performed by an implementation of the encoding scheme selector 120 to obtain the results shown in FIG. 10A. In this example, selector 120 is configured to select a higher-rate encoding scheme 1 for voiced frames, a lower-rate encoding scheme 3 for inactive frames, and an intermediate-rate encoding scheme 2 for unvoiced frames and the first inactive frame after the transition from active to inactive frames. In this application, coding schemes 1 through 3 may adhere to the three schemes shown in fig. 18.
An alternative implementation of coding scheme selector 120 may be configured to operate according to the state diagram of fig. 20B to obtain equivalent results. In this figure, label "a" indicates state transitions in response to valid frames, label "I" indicates state transitions in response to invalid frames, and labels for various states indicate the encoding scheme selected for the current frame. In this case, the status label "scheme 1/2" indicates that either coding scheme 1 or coding scheme 2 is selected for the current active frame depending on whether the frame is voiced or unvoiced. Those skilled in the art will appreciate that in an alternative implementation, this state may be configured such that the coding scheme selector supports only one coding scheme (e.g., coding scheme 1) for active frames. In another alternative implementation, this state may be configured such that the coding scheme selector selects from more than two different coding schemes for active frames (e.g., selects different coding schemes for voiced, unvoiced, and transition frames).
As mentioned above with reference to fig. 12B, it may be desirable for a speech encoder to encode invalid frames at a higher bit rate r2 only if the most recent valid frame is part of a talk spurt having at least a minimum length. An implementation of coding scheme selector 120 may be configured to operate according to the state diagram of fig. 21A to obtain the results shown in fig. 12B. In this particular example, the selector is configured to select encoding scheme 2 for an inactive frame only if the inactive frame immediately follows a string of consecutive active frames having a length of at least three frames. In this case, the status label "scheme 1/2" indicates that either coding scheme 1 or coding scheme 2 is selected for the current active frame depending on whether the frame is voiced or unvoiced. Those skilled in the art will appreciate that in alternative implementations, these states may be configured such that the coding scheme selector supports only one coding scheme (e.g., coding scheme 1) for active frames. In another alternative implementation, these states may be configured such that the coding scheme selector selects from more than two different coding schemes for active frames (e.g., selects different schemes for voiced, unvoiced, and transition frames).
As mentioned above with reference to fig. 10B and 12A, it may be desirable for the speech encoder to apply a deferral (i.e., continue to use a higher bit rate for one or more inactive frames following the transition from an active frame to an inactive frame). An implementation of coding scheme selector 120 may be configured to operate according to the state diagram of fig. 21B to apply a deferral having a length of three frames. In this figure, the deferred status is labeled "scheme 1 (2)" to indicate that either coding scheme 1 or coding scheme 2 is indicated for the current invalid frame, depending on the scheme selected for the most recent valid frame. Those skilled in the art will appreciate that in alternative implementations, the coding scheme selector may support only one coding scheme (e.g., coding scheme 1) for active frames. In another alternative implementation, the deferred status may be configured to continue to indicate one of more than two different coding schemes (e.g., for the case where different schemes are supported for voiced, unvoiced, and transition frames). In another alternative implementation, one or more of the deferred states may be configured to indicate a fixed scheme (e.g., scheme 1), even if a different scheme (e.g., scheme 2) is selected for the most recent active frame.
As mentioned above with reference to FIGS. 11B and 12A, it may be desirable for the speech encoder to generate a second encoded frame based on information averaged over more than one inactive frame of the speech signal. An implementation of coding scheme selector 120 may be configured to operate according to the state diagram of fig. 21C to support this result. In this particular example, the selector is configured to direct the encoder to generate a second encoded frame based on information averaged over three inactive frames. The state labeled "scheme 2 (start average)" indicates to the encoder that the current frame is to be encoded with scheme 2 and is also used to calculate a new average (e.g., an average of the description of the spectral envelope). The state labeled "scheme 2 (for mean)" indicates to the encoder that the current frame is to be encoded with scheme 2 and is also used to continue calculating the mean. The state labeled "send average, scheme 2" indicates to the encoder that the current frame will be used to complete the average, which is then sent using scheme 2. Those skilled in the art will appreciate that alternative implementations of the encoding scheme selector 120 may be configured to allocate and/or indicate averaging of information over different numbers of inactive frames using different schemes.
FIG. 19B shows a block diagram of an implementation 132 of the speech encoder 130, the implementation 132 including a spectral envelope description calculator 140, a temporal information description calculator 150, and a formatter 160. The spectral envelope description calculator 140 is configured to calculate a description of the spectral envelope of each frame to be encoded. The temporal information description calculator 150 is configured to calculate a description of the temporal information for each frame to be encoded. Formatter 160 is configured to generate an encoded frame that includes the calculated description of the spectral envelope and the calculated description of the temporal information. The formatter 160 may be configured to generate the encoded frame according to a desired packet format, possibly using different formats for different encoding schemes. Formatter 160 may be configured to generate an encoded frame to include additional information (also referred to as "encoding indices") from which the frame is encoded, such as a set of one or more bits that identify an encoding scheme or encoding rate or mode.
The spectral envelope description calculator 140 is configured to calculate a description of the spectral envelope for each frame to be encoded according to the encoding scheme indicated by the encoding scheme selector 120. The description is based on the current frame and may also be based on at least a portion of one or more other frames. For example, calculator 140 may be configured to apply a window extending into one or more adjacent frames and/or calculate an average of the descriptions of two or more frames (e.g., an average of the LSP vectors).
The calculator 140 may be configured to calculate a description of a spectral envelope of a frame by performing a spectral analysis, such as an LPC analysis. Fig. 19C shows a block diagram of an implementation 142 of the spectral envelope description calculator 140, the implementation 142 comprising an LPC analysis module 170, a transform block 180, and a quantizer 190. The analysis module 170 is configured to perform an LPC analysis of the frame and generate a corresponding set of model parameters. For example, the analysis module 170 may be configured to generate a vector of LPC coefficients, such as filter coefficients or reflection coefficients. The analysis module 170 may be configured to perform analysis over a window comprising portions of one or more neighboring frames. In some cases, analysis module 170 is configured to select an order of analysis (e.g., a number of elements in a coefficient vector) according to the coding scheme indicated by coding scheme selector 120.
Transform block 180 is configured to convert the set of model parameters into a form that is more efficient for quantization. For example, transform block 180 may be configured to convert the LPC coefficient vector to a set of LSPs. In some cases, transform block 180 is configured to convert the set of LPC coefficients to a particular form according to the coding scheme indicated by coding scheme selector 120.
The quantizer 190 is configured to generate a description of the spectral envelope in quantized form by quantizing the transformed set of model parameters. Quantizer 190 may be configured to quantize the converted set by truncating elements of the converted set and/or by selecting one or more quantization table indices to represent the converted set. In some cases, quantizer 190 is configured to quantize the converted set to a particular form and/or length according to the coding scheme indicated by coding scheme selector 120 (e.g., as discussed above with reference to fig. 18).
The temporal information description calculator 150 is configured to calculate a description of temporal information for a frame. The description may likewise be based on temporal information of at least a portion of one or more other frames. For example, calculator 150 may be configured to calculate descriptions over windows extending into one or more adjacent frames and/or calculate an average of the descriptions of two or more frames.
The temporal information description calculator 150 may be configured to calculate a description of the temporal information having a particular form and/or length according to the coding scheme indicated by the coding scheme selector 120. For example, calculator 150 may be configured to calculate a description of the temporal information according to the selected coding scheme, the description including one or both of: (A) a temporal envelope of the frame; and (B) an excitation signal for the frame, which may include a description of the pitch component (e.g., pitch lag (also referred to as delay), pitch gain, and/or a description of the prototype).
The calculator 150 may be configured to calculate a description of the temporal information, which includes a temporal envelope (e.g., gain frame values and/or gain shape values) of the frame. For example, the calculator 150 may be configured to output such a description in response to an indication of a NELP encoding scheme. As described herein, computing this description may include computing the signal energy over a frame or subframe as a sum of squares of signal samples, computing the signal energy over a window that includes portions of other frames and/or subframes, and/or quantizing the computed temporal envelope.
Calculator 150 may be configured to calculate a description of temporal information for a frame, including information related to the pitch or periodicity of the frame. For example, calculator 150 may be configured to output a description including pitch information (e.g., pitch lag and/or pitch gain) for the frame in response to the indication of the CELP encoding scheme. Alternatively or additionally, the calculator 150 may be configured to output a description including a periodic waveform (also referred to as a "prototype") in response to an indication of the PPP encoding scheme. Computing pitch and/or prototype information typically includes extracting this information from the LPC residual and may also include combining pitch and/or prototype information from the current frame with this information from one or more past frames. Calculator 150 may also be configured to quantize this description of time information (e.g., into one or more table indices).
Calculator 150 may be configured to calculate a description of time information for a frame, including an excitation signal. For example, the calculator 150 may be configured to output a description including the excitation signal in response to the indication of the CELP encoding scheme. Computing the excitation signal typically includes deriving this signal from the LPC residual and may also include combining excitation information from the current frame with this information from one or more past frames. Calculator 150 may also be configured to quantize this description of time information (e.g., into one or more table indices). For the case where speech encoder 132 supports the relaxed celp (rcelp) coding scheme, calculator 150 may be configured to normalize the excitation signal.
FIG. 22A shows a block diagram of an implementation 134 of a speech encoder 132, the implementation 134 including an implementation 152 of a temporal information description calculator 150. The calculator 152 is configured to calculate a description of temporal information (e.g., excitation signal, pitch, and/or prototype information) for a frame based on a description of the spectral envelope of the frame as calculated by the spectral envelope description calculator 140.
FIG. 22B shows a block diagram of an implementation 154 of the temporal information description calculator 152, the implementation 154 configured to calculate a description of temporal information based on LPC residuals for frames. In this example, the calculator 154 is arranged to receive a description of the spectral envelope of the frame as calculated by the spectral envelope description calculator 142. Dequantizer a10 is configured to dequantize the description, and inverse transform block a20 is configured to apply an inverse transform to the dequantized description in order to obtain a set of LPC coefficients. The whitening filter a30 is configured according to the set of LPC coefficients and is arranged to filter the speech signal to generate an LPC residual. Quantizer a40 is configured to quantize a description of temporal information for a frame (e.g., to one or more table indices) that is based on an LPC residual and possibly also based on pitch information for the frame and/or temporal information from one or more past frames.
It may be desirable to use an implementation of speech encoder 132 to encode frames of a wideband speech signal according to a banded coding scheme. In this case, the spectral envelope description calculator 140 may be configured to calculate various descriptions of the spectral envelope of a frame over respective frequency bands, consecutively and/or in parallel and possibly according to different encoding modes and/or rates. The temporal information description calculator 150 may also be configured to calculate descriptions of temporal information over various frequency bands for a frame, sequentially and/or in parallel and possibly according to different coding modes and/or rates.
FIG. 23A shows a block diagram of an implementation 102 of apparatus 100, the implementation 102 configured to encode a wideband speech signal according to a banded encoding scheme. Apparatus 102 includes a filter bank a50 configured to filter a speech signal to generate a subband signal (e.g., a narrowband signal) containing content of the speech signal on a first frequency band and a subband signal (e.g., a highband signal) containing content of the speech signal on a second frequency band. Specific examples of such filter banks are described in, for example, U.S. patent application publication No. 2007/088558 entitled "system, method, and apparatus for speech signal filtering" (waters (Vos), et al), which is published, for example, on 19/4/2007. For example, filter bank a50 may include a low pass filter configured to filter the speech signal to produce a narrowband signal and a high pass filter configured to filter the speech signal to produce a highband signal. Filterbank a50 may also include a down-sampler configured to reduce the sampling rate of narrowband signals and/or highband signals according to a desired respective decimation factor, as described, for example, in U.S. patent application publication No. 2007/088558 (waters (Vos), et al). The device 102 may also be configured to perform noise suppression operations, such as high-band burst suppression operations, on at least high-band signals, as described in U.S. patent application publication No. 2007/088541 entitled "SYSTEMS, METHODS, and devices for high-band burst suppression" (waters et al), published on 19/4/2007.
Apparatus 102 also includes an implementation 136 of speech encoder 130 that is configured to encode the separate subband signals according to the coding scheme selected by coding scheme selector 120. FIG. 23B shows a block diagram of an implementation 138 of the speech encoder 136. The encoder 138 includes a spectral envelope calculator 140a (e.g., an example of calculator 142) and a temporal information calculator 150a (e.g., an example of calculator 152 or 154) configured to calculate descriptions of the spectral envelope and the temporal information, respectively, based on the narrowband signal generated by filter bank a50 and according to a selected encoding scheme. The encoder 138 also includes a spectral envelope calculator 140b (e.g., an example of calculator 142) and a temporal information calculator 150b (e.g., an example of calculator 152 or 154) configured to generate calculated descriptions of spectral envelope and temporal information, respectively, based on the highband signal generated by filter bank a50 and according to a selected encoding scheme. The encoder 138 also includes an implementation 162 of the formatter 160 that is configured to generate an encoded frame that includes the calculated description of the spectral envelope and temporal information.
As mentioned above, the description of the temporal information for the highband portion of a wideband speech signal may be based on the description of the temporal information for the narrowband portion of the signal. FIG. 24A shows a block diagram of a corresponding implementation 139 of wideband speech encoder 136. As with the speech encoder 138 described above, the encoder 139 comprises spectral envelope description calculators 140a and 140b arranged to calculate respective descriptions of spectral envelopes. The speech encoder 139 also comprises an instance 152a of the temporal information description calculator 152 (e.g. calculator 154) arranged to calculate a description of the temporal information based on the calculated description of the spectral envelope of the narrowband signal. The speech encoder 139 also includes an implementation 156 of the temporal information description calculator 150. The calculator 156 is configured to calculate a description of the time information of the high band signal, which is based on the description of the time information of the narrow band signal.
FIG. 24B shows a block diagram of an implementation 158 of the temporal description calculator 156. Calculator 158 includes a high-band excitation signal generator a60 configured to generate a high-band excitation signal based on the narrow-band excitation signal as generated by calculator 152 a. For example, generator a60 may be configured to perform operations such as spectral extension, harmonic extension, non-linear extension, spectral folding, and/or spectral translation on a narrowband excitation signal (or one or more components thereof) to generate a highband excitation signal. Additionally or alternatively, generator a60 may be configured to perform spectral and/or amplitude shaping of random noise (e.g., a pseudorandom gaussian noise signal) to generate a high-band excitation signal. For the case where generator a60 uses a pseudo random noise signal, it may be desirable to synchronize the encoder and decoder generation of this signal. Such METHODS and apparatus for high-band excitation signal generation are described in more detail in, for example, U.S. patent application publication No. 2007/0088542 entitled "system, method and apparatus for wideband speech coding" (waters (Vos) et al), which is published, for example, on 19/4/2007. In the example of fig. 24B, generator a60 is arranged to receive a quantized narrowband excitation signal. In another example, the generator a60 is arranged to receive the narrowband excitation signal in another form (e.g. in a pre-quantized or de-quantized form).
The calculator 158 also includes a synthesis filter a70 configured to generate a synthesized highband signal based on the highband excitation signal and a description of the spectral envelope of the highband signal (e.g., as generated by the calculator 140 b). Filter a70 is typically configured according to a set of values within a description of the spectral envelope of the highband signal (e.g., one or more LSP or LPC coefficient vectors) to produce a synthesized highband signal in response to a highband excitation signal. In the example of fig. 24B, synthesis filter a70 is arranged to receive a quantized description of the spectral envelope of the highband signal and may correspondingly be configured to comprise a dequantizer and (possibly) an inverse transform block. In another example, the filter a70 is arranged to receive a description of the spectral envelope of the highband signal in another form (e.g. in a pre-quantized or de-quantized form).
The calculator 158 also comprises a high-band gain factor calculator a80 configured to calculate a description of the temporal envelope of the high-band signal based on the temporal envelope of the synthesized high-band signal. The calculator a80 may be configured to calculate this description as including one or more distances between the temporal envelope of the highband signal and the temporal envelope of the synthesized highband signal. For example, calculator a80 may be configured to calculate this distance as a gain frame value (e.g., as a ratio between energy measurements of corresponding frames of the two signals, or as a square root of such a ratio). Additionally or alternatively, calculator a80 may be configured to calculate a number of such distances as gain shape values (e.g., as ratios between energy measurements of corresponding subframes of the two signals, or as square roots of such ratios). In the example of fig. 24B, calculator 158 also includes quantizer a90 configured to quantize the calculated description of the temporal envelope (e.g., into one or more codebook indices). Various features and embodiments of the elements of calculator 158 are described, for example, in U.S. patent application publication No. 2007/0088542 (waters et al) cited above.
The various elements of the implementation of apparatus 100 may be embodied in any combination of hardware, software, and/or firmware as deemed suitable for the desired application. Such elements may be fabricated, for example, as electronic and/or optical devices residing, for example, on the same chip or between two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such array(s) may be implemented within one or more chips (e.g., within a chipset comprising two or more chips).
One or more elements of various implementations of the apparatus 100 as described herein may also be implemented, in whole or in part, as one or more sets of instructions arranged to be executed on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field programmable gate arrays), ASSPs (application specific standard products), and ASICs (application specific integrated circuits). Any of the various elements of an implementation of apparatus 100 may also be embodied as one or more computers (e.g., a machine comprising one or more arrays programmed to execute one or more sets or sequences of instructions, also referred to as a "processor"), and any two or more, or even all, of these elements may be implemented within the same such computer(s).
Various elements of an implementation of apparatus 100 may be included within a device for wireless communication, such as a cellular telephone or other device having such communication capabilities. Such a device may be configured to communicate with a circuit-switched and/or packet-switched network (e.g., using one or more protocols such as VoIP). Such a device may be configured to perform operations on a signal carrying encoded frames, such as interleaving, puncturing, convolutional encoding, error correction encoding, encoding of one or more network protocol (e.g., ethernet, TCP/IP, cdma2000) layers, Radio Frequency (RF) modulation, and/or RF transmission.
It may be possible for one or more elements of an implementation of apparatus 100 to be used to perform tasks or other sets of instructions that are not directly related to the operation of the apparatus, such as tasks related to another operation of a device or system in which the apparatus is embedded. It is also possible for one or more elements of an implementation of apparatus 100 to have a common structure (e.g., a processor to execute portions of code corresponding to different elements at different times, a set of instructions executed to perform tasks corresponding to different elements at different times, or an arrangement of electronic and/or optical devices performing operations for different elements at different times). In one such example, voice activity detector 110, coding scheme selector 120, and speech encoder 130 are implemented as a set of instructions arranged to execute on the same processor. In another such example, the spectral envelope description calculators 140a and 140b are implemented as the same set of instructions executed at different times.
FIG. 25A shows a flowchart of a method M200 of processing an encoded speech signal according to a general configuration. Method M200 is configured to receive information from two encoded frames and to generate a description of the spectral envelopes of two corresponding frames of the speech signal. Based on information from the first encoded frame (also referred to as a "reference" encoded frame), task T210 obtains a description of the spectral envelope of the first frame of the speech signal over the first and second frequency bands. Based on the information from the second encoded frame, task T220 obtains a description of a spectral envelope of a second frame (also referred to as a "target" frame) of the speech signal over the first frequency band. Based on the information from the reference encoded frame, task T230 obtains a description of a spectral envelope of the target frame over the second frequency band.
FIG. 26 shows an application of method M200, which method M200 receives information from two encoded frames and generates a description of the spectral envelopes of two corresponding inactive frames of a speech signal. Based on information from the reference encoded frame, task T210 obtains a description of the spectral envelope of the first inactive frame over the first and second frequency bands. This description may be a single description extending over the two frequency bands, or it may comprise separate descriptions each extending over a respective one of the frequency bands. Based on the information from the second encoded frame, task T220 obtains a description of a spectral envelope of the target inactive frame over the first frequency band (e.g., over a narrow-band range). Based on information from the reference encoded frame, task T230 obtains a description of a spectral envelope of the target inactive frame over a second frequency band (e.g., over a high-band range).
Fig. 26 shows an example where the description of the spectral envelope has an LPC order and the LPC order of the description of the spectral envelope of the target frame over the second frequency band is smaller than the LPC order of the description of the spectral envelope of the target frame over the first frequency band. Other examples include the cases where the LPC order for the description of the spectral envelope of the target frame over the second frequency band is at least fifty percent, at least sixty percent, no more than seventy-five percent, no more than eighty percent, equal to, and greater than the LPC order for the description of the spectral envelope of the target frame over the first frequency band. In a specific example, the LPC orders for the descriptions of the spectral envelope of the target frame over the first and second frequency bands are 10 and 6, respectively. Fig. 26 also shows an example where the LPC order of the description of the spectral envelope of the first inactive frame over the first and second frequency bands is equal to the sum of the LPC orders of the description of the spectral envelope of the target frame over the first and second frequency bands. In another example, the LPC order for the description of the spectral envelope of the first inactive frame over the first and second frequency bands may be greater than or less than the sum of the LPC orders for the description of the spectral envelope of the target frame over the first and second frequency bands.
Each of tasks T210 and T220 may be configured to include one or both of the following two operations: parsing the encoded frame to extract a quantization description of a spectral envelope; and dequantizing the quantized description of the spectral envelope to obtain a set of parameters of a coding model for the frame. Typical implementations of tasks T210 and T220 include these two operations such that each task processes a respective encoded frame to produce a description of the spectral envelope in the form of a set of model parameters (e.g., one or more LSF, LSP, ISF, ISP, and/or LPC coefficient vectors). In one particular example, the reference encoded frame has a length of 80 bits and the second encoded frame has a length of 16 bits. In other examples, the length of the second encoded frame does not exceed twenty, twenty-five, thirty, forty, fifty, or sixty percent of the length of the reference encoded frame.
The reference encoded frame may comprise a quantized description of the spectral envelope over the first and second frequency bands, and the second encoded frame may comprise a quantized description of the spectral envelope over the first frequency band. In one particular example, the quantized description of the spectral envelope over the first and second frequency bands included in the reference encoded frame has a length of 40 bits, and the quantized description of the spectral envelope over the first frequency band included in the second encoded frame has a length of 10 bits. In other examples, the length of the quantized description of the spectral envelope over the first frequency band included in the second encoded frame is not greater than twenty-five, thirty, forty, fifty, or sixty percent of the length of the quantized description of the spectral envelope over the first and second frequency bands included in the reference encoded frame.
Tasks T210 and T220 may also be implemented to generate a description of temporal information based on information from the respective encoded frames. For example, one or both of these tasks may be configured to obtain a description of a temporal envelope, a description of an excitation signal, and/or a description of pitch information based on information from respective encoded frames. As in obtaining a description of the spectral envelope, this task may include parsing a quantization description of temporal information and/or dequantizing a quantization description of temporal information from an encoded frame. Implementations of method M200 may also be configured such that task T210 and/or task T220 obtain a description of the spectral envelope and/or a description of the temporal information, again based on information from one or more other encoded frames, such as information from one or more previously encoded frames. For example, the description of the excitation signal and/or pitch information for a frame is typically based on information from a previous frame.
The reference encoded frame may include a quantized description of temporal information for the first and second frequency bands, and the second encoded frame may include a quantized description of temporal information for the first frequency band. In one particular example, the quantization description for the temporal information for the first and second frequency bands included in the reference encoded frame has a length of 34 bits, and the quantization description for the temporal information for the first frequency band included in the second encoded frame has a length of 5 bits. In other examples, the length of the quantized description of temporal information for the first frequency band included in the second encoded frame is not greater than fifteen, twenty-five, thirty, forty, fifty, or sixty percent of the length of the quantized description of temporal information for the first and second frequency bands included in the reference encoded frame.
Method M200 is typically performed as part of a larger speech decoding method, and speech decoders and speech decoding methods configured to perform method M200 are expressly contemplated and hereby disclosed. The speech encoding device may be configured to perform an implementation of method M100 at the encoder and to perform an implementation of method M200 at the decoder. In this case, the "second frame" as encoded by task T120 corresponds to the reference encoded frame supplying information processed by tasks T210 and T230, and the "third frame" as encoded by task T130 corresponds to the encoded frame supplying information processed by task T220. Fig. 27A illustrates this relationship between methods M100 and M200 using an example of a series of consecutive frames encoded using method M100 and decoded using method M200. Alternatively, a speech encoding device may be configured to perform an implementation of method M300 at an encoder and to perform an implementation of method M200 at a decoder. Fig. 27B illustrates this relationship between methods M300 and M200 using an example of a pair of consecutive frames encoded using method M300 and decoded using method M200.
Note, however, that method M200 may also be applied to process information from encoded frames that are not contiguous. For example, method M200 may be applied such that tasks T220 and T230 process information from respective encoded frames that are not contiguous. Method M200 is typically implemented such that task T230 iterates over a reference encoded frame, and task T220 iterates over a series of consecutive encoded inactive frames following the reference encoded frame in order to generate a series of corresponding consecutive target frames. This iteration may continue, for example, until a new reference encoded frame is received, until an encoded valid frame is received, and/or until a maximum number of target frames have been generated.
Task T220 is configured to obtain a description of a spectral envelope of the target frame over the first frequency band based at least primarily on information from the second encoded frame. For example, task T220 may be configured to obtain a description of a spectral envelope of the target frame over a first frequency band based entirely on information from the second encoded frame. Alternatively, task T220 may be configured to obtain a description of a spectral envelope of the target frame over the first frequency band based on other information as well, such as information from one or more previously encoded frames. In this case, task T220 is configured to weight information from the second encoded frame more than other information. For example, such an implementation of task T220 may be configured to calculate a description of a spectral envelope of the target frame over the first frequency band as an average of information from the second encoded frame and information from previously encoded frames, wherein information from the second encoded frame is weighted more than information from previously encoded frames. Likewise, task T220 may be configured to obtain a description of the temporal information for the first frequency band for the target frame based at least primarily on information from the second encoded frame.
Based on information from the reference encoded frame (also referred to herein as "reference spectral information"), task T230 obtains a description of a spectral envelope of the target frame over the second frequency band. FIG. 25B shows a flowchart of an implementation M210 of method M200, the implementation M210 including an implementation T232 of task T230. As an embodiment of task T230, task T232 obtains a description of a spectral envelope of the target frame over the second frequency band based on the reference spectral information. In this case, the reference spectral information is included within a description of a spectral envelope of a first frame of the speech signal. FIG. 28 shows an application of method M210, which method M210 receives information from two encoded frames and generates a description of the spectral envelope of two corresponding inactive frames of a speech signal.
Task T230 is configured to obtain a description of a spectral envelope of the target frame over the second frequency band based at least primarily on the reference spectral information. For example, task T230 may be configured to obtain a description of a spectral envelope of the target frame over the second frequency band based entirely on the reference spectral information. Alternatively, task T230 may be configured to obtain a description of the spectral envelope of the target frame over the second frequency band based on (a) the description of the spectral envelope over the second frequency band based on the reference spectral information and (B) the description of the spectral envelope over the second frequency band based on information from the second encoded frame.
In this case, task T230 may be configured to make the weights applied to the description based on the reference spectral information greater than the weights applied to the description based on the information from the second encoded frame. For example, such an implementation of task T230 may be configured to calculate the description of the spectral envelope of the target frame over the second frequency band as an average of descriptions based on the reference spectral information and information from the second encoded frame, wherein weights applied to the descriptions based on the reference spectral information are greater than weights applied to the descriptions based on the information from the second encoded frame. In another case, the LPC order based on the description of the reference spectral information may be greater than the LPC order based on the description of the information from the second encoded frame. For example, the LPC order based on the description of the information from the second encoded frame may be 1 (e.g., a spectral tilt value). Likewise, task T230 may be configured to obtain a description of the temporal information for the second frequency band for the target frame based at least primarily on the reference temporal information (e.g., based entirely on the reference temporal information, or also based in smaller part on information from the second encoded frame).
Task T210 may be implemented to obtain, from the reference encoded frame, a description of the spectral envelope that is a single full-band representation over both the first and second frequency bands. However, task T210 is more typically implemented to obtain this description as a separate description of the spectral envelope over the first frequency band and over the second frequency band. For example, task T210 may be configured to obtain a separate description from a reference encoded frame that has been encoded using a banded encoding scheme (e.g., encoding scheme 2) as described herein.
FIG. 25C shows a flowchart of an implementation M220 of method M210, where task T210 is implemented as two tasks T212a and T212 b. Based on the information from the reference encoded frame, task T212a obtains a description of a spectral envelope of the first frame over the first frequency band. Based on the information from the reference encoded frame, task T212b obtains a description of a spectral envelope of the first frame over the second frequency band. Each of tasks T212a and T212b may include parsing a quantized description of a spectral envelope and/or dequantizing a quantized description of a spectral envelope from a respective encoded frame. FIG. 29 shows the application of method M220, which method M220 receives information from two encoded frames and generates a description of the spectral envelope of two corresponding inactive frames of the speech signal.
Method M220 also includes an implementation T234 of task T232. As an embodiment of task T230, task T234 obtains a description of a spectral envelope of the target frame over the second frequency band, the description based on the reference spectral information. As in task T232, the reference spectral information is included within a description of a spectral envelope of a first frame of the speech signal. In the particular case of task T234, the reference spectral information is included within (and possibly identical to) a description of a spectral envelope of the first frame over the second frequency band.
Fig. 29 shows an example where the description of the spectral envelope has an LPC order and the LPC order of the description of the spectral envelope of the first inactive frame over the first and second frequency bands is equal to the LPC order of the description of the spectral envelope of the target inactive frame over the respective frequency bands. Other examples include the case where one or both of the descriptions of the spectral envelopes of the first inactive frame over the first and second frequency bands are greater than the corresponding descriptions of the spectral envelopes of the target inactive frame over the respective frequency bands.
The reference encoded frame may include a quantized description of a spectral envelope over a first frequency band and a quantized description of a spectral envelope over a second frequency band. In one particular example, the quantization description referring to the description of the spectral envelope over the first frequency band included in the encoded frame has a length of 28 bits, and the quantization description referring to the description of the spectral envelope over the second frequency band included in the encoded frame has a length of 12 bits. In other examples, the length of the quantization description for the description of the spectral envelope over the second frequency band that is included in the reference encoded frame is not greater than forty-five percent, fifty, sixty, or seventy percent of the length of the quantization description for the description of the spectral envelope over the first frequency band that is included in the reference encoded frame.
The reference encoded frame may include a quantization description of a description of temporal information for the first frequency band and a quantization description of a description of temporal information for the second frequency band. In one particular example, the quantization description referring to the description of temporal information for the second frequency band included in the encoded frame has a length of 15 bits, and the quantization description referring to the description of temporal information for the first frequency band included in the encoded frame has a length of 19 bits. In other examples, the length of the quantized description of the temporal information for the second frequency band included in the reference encoded frame is no greater than eighty percent or ninety percent of the length of the quantized description of the temporal information for the first frequency band included in the reference encoded frame.
The second encoded frame may comprise a quantized description of a spectral envelope over the first frequency band and/or a quantized description of temporal information for the first frequency band. In one particular example, the quantized description of the spectral envelope over the first frequency band included in the second encoded frame has a length of 10 bits. In other examples, the length of the quantized description of the spectral envelope over the first frequency band included in the second encoded frame is no greater than forty, fifty, sixty, seventy, or seventy-five percent of the length of the quantized description of the spectral envelope over the first frequency band included in the reference encoded frame. In one particular example, the quantization description of the temporal information for the first frequency band included in the second encoded frame has a length of 5 bits. In other examples, the length of the quantization description of the temporal information for the first frequency band included in the second encoded frame is no greater than thirty percent, forty percent, fifty percent, sixty percent, or seventy percent of the length of the quantization description of the temporal information for the first frequency band included in the reference encoded frame.
In an exemplary embodiment of method M200, the reference spectral information is a description of a spectral envelope over the second frequency band. This description may include a set of model parameters, such as one or more LSP, LSF, ISP, ISF, or LPC coefficient vectors. In general, this description is a description of the spectral envelope of the first null frame over the second frequency band as obtained from the reference encoded frame by task T210. It is also possible to have the reference spectral information comprise a description of a spectral envelope (e.g. of the first inactive frame) on the first frequency band and/or on another frequency band.
Task T230 generally includes an operation to retrieve reference spectral information from an array of storage elements, such as semiconductor memory (also referred to herein as a "buffer"). For the case where the reference spectral information includes a description of a spectral envelope over the second frequency band, the act of retrieving the reference spectral information may be sufficient to complete task T230. However, even for this case, task T230 may still need to be configured to compute a description of the spectral envelope of the target frame over the second frequency band (also referred to herein as the "target spectral description") rather than simply retrieve it. For example, task T230 may be configured to calculate the target spectral description by adding random noise to the reference spectral information. Alternatively or additionally, task T230 may be configured to calculate the description based on spectral information from one or more additional encoded frames (e.g., based on information from more than one reference encoded frame). For example, task T230 may be configured to calculate the target spectral description as an average of descriptions of spectral envelopes over the second frequency band from two or more reference encoded frames, and this calculation may include adding random noise to the calculated average.
Task T230 may be configured to calculate the target spectral description by extrapolating in time from the reference spectral information or by interpolating in time between descriptions of spectral envelopes over the second frequency band from two or more reference encoded frames. Alternatively or additionally, task T230 may be configured to calculate the target spectral description by extrapolating in frequency from a description of a spectral envelope of the target frame over another frequency band (e.g., over the first frequency band) and/or by interpolating in frequency between descriptions of spectral envelopes over other frequency bands.
In general, reference spectrum information and target spectrumThe description is a vector of spectral parameter values (or "spectral vector"). In one such example, both the target and reference spectral vectors are LSP vectors. In another example, both the target and reference spectral vectors are LPC coefficient vectors. In yet another example, both the target and reference spectral vectors are reflection coefficient vectors. Task T230 may be configured according to, for exampleTo copy the target spectrum description from the reference spectrum information, where stIs a target spectral vector, srIs a reference spectral vector (whose value is typically in the range-1 to + 1), i is the vector element index, and n is the vector stLength of (d). In a variation of this operation, task T230 is configured to apply a weighting factor (or a vector of weighting factors) to the reference spectral vector. In another variation of this operation, task T230 is configured to execute the task by a method according toThe target spectral vector is calculated by adding random noise to the reference spectral vector, where z is a vector of random values. In this case, each element of z may be a random variable whose value is distributed (e.g., uniformly) over a desired range.
It may be necessary to ensure that the value of the target spectrum description is constrained (e.g., in the range of-1 to + 1). In this case, task T230 may be configured according to, for exampleWhere w has a value between 0 and 1 (e.g., in the range of 0.3 to 0.9) and the value of each element of z is distributed (e.g., uniformly) over a range from- (1-w) to + (1-w).
In another example, task T230 is configured to determine a spectral envelope over the second frequency band based on pairs from each of more than one reference encoded frame (e.g., from each of the two most recent reference encoded frames)The target spectrum description is calculated. In one such example, task T230 is configured according to, for exampleIs calculated as the average of the information from the reference encoded frame, where sr1Represents the spectral vector from the most recent reference encoded frame, and sr2Representing the spectral vector from the next closest reference encoded frame. In a related example, the reference vectors are weighted differently from one another (e.g., vectors from more recent reference encoded frames may be weighted more heavily).
In yet another example, task T230 is configured to generate the target spectral description as a set of random values over a range based on information from two or more reference encoded frames. For example, task T230 may be configured to target spectral vector s according to an expression such astCalculated as the random average of the spectral vectors from each of the two most recent reference encoded frames
Where the value of each element of z is distributed (e.g., uniformly) over a range of-1 to + 1. FIG. 30A illustrates the result of iterating this implementation of task T230 for each of a series of consecutive target frames (i for one of n values), with random vector z being reevaluated for each iteration, with an open circle indicator value sti。
Task T230 may be configured to calculate the target spectral description by interpolating between descriptions of spectral envelopes over the second frequency band from the two most recent reference frames. For example, task T230 may be configured to perform linear interpolation over a series of p target frames, where p is an adjustable parameter. In this case, task T230 may be configured to calculate the target spectral vector for the jth target frame in the series according to an expression such asWhereinAnd j is more than or equal to 1 and less than or equal to p.
FIG. 30B illustrates the result of this implementation of iterating task T230 over a series of consecutive target frames (for one of n values, i), where p is equal to 8 and each open circle indicates the value s of the corresponding target frameti. Other examples of values for p include 4, 16, and 32. This implementation of task T230 may need to be configured to add random noise to the interpolated description.
FIG. 30B also shows that task T230 is configured to reference vector s for each subsequent target frame in the series that is longer than pr1Copy to target vector st(e.g., until a new reference encoded frame or the next valid frame is received). In a related example, the series of target frames has a length mp, where m is an integer greater than 1 (e.g., 2 or 3), and each of the p computed vectors is used as a target spectral description for each of m corresponding consecutive target frames in the series.
Task T230 may be implemented in a number of different ways to perform interpolation between descriptions of spectral envelopes over the second frequency band from the two most recent reference frames. In another example, task T230 is configured to perform linear interpolation on a series of p target frames by calculating a target vector for a jth target frame in the series according to a pair of expressions, such as
sti=α1sr1i+(1-α1)sr2iWherein
For all integers j, such that 0 < j ≦ q, an
sti=(1-α2)sr1i+α2sr2iWherein
For all integers j, such that q < j ≦ p. FIG. 30C illustrates the result of this implementation of iteration task T230 for each of a series of consecutive target frames (i for one of n values), where q has a value of 4 and p has a value of 8. This configuration may provide a smoother transition to the first target frame than the result shown in fig. 30B.
Task T230 may be implemented in a similar manner for any positive integer values of q and p; specific examples of values of (q, p) that can be used include (4, 8), (4, 12), (4, 16), (8, 24), (8, 32), and (16, 32). In the relevant example as described above, each of the p calculated vectors is used as the target spectrum description for each of the m corresponding consecutive target frames in the series of mp target frames. This implementation of task T230 may need to be configured to add random noise to the interpolated description. FIG. 30C also shows that task T230 is configured to reference vector s for each subsequent target frame in the series that is longer than pr1Copy to target vector st(e.g., until a new reference encoded frame or the next valid frame is received).
Task T230 may also be implemented to calculate a target spectral description based on a spectral envelope of one or more frames over another frequency band in addition to the reference spectral information. For example, such an implementation of task T230 may be configured to calculate the target spectral description by extrapolating in frequency from a spectral envelope of the current frame and/or one or more previous frames over another frequency band (e.g., a first frequency band).
Task T230 may also be configured to obtain a description of temporal information of the target invalid frame on the second frequency band based on information from the reference encoded frame (also referred to herein as "reference temporal information"). The reference time information is typically a description of the time information on the second frequency band. This description may include one or more gain frame values, gain profile values, pitch parameter values, and/or codebook indices. In general, this description is a description of temporal information on the second frequency band for the first null frame as obtained by task T210 from the reference encoded frame. It is also possible to have the reference time information comprise a description of the time information (e.g. of the first dummy frame) on the first frequency band and/or on another frequency band.
Task T230 may be configured to obtain a description of the temporal information of the target frame on the second frequency band (also referred to herein as a "target temporal description") by copying the reference temporal information. Alternatively, it may be necessary to configure task T230 to obtain the target time description by calculating the target time description based on the reference time information. For example, task T230 may be configured to calculate the target time description by adding random noise to the reference time information. Task T230 may also be configured to calculate a target time description based on information from more than one reference encoded frame. For example, task T230 may be configured to calculate the target time description as an average of descriptions of the time information over the second frequency band from two or more reference encoded frames, and this calculation may include adding random noise to the calculated average.
The target time description and the reference time information may each comprise a description of a time envelope. As mentioned above, the description of the temporal envelope may comprise a gain frame value and/or a set of gain shape values. Alternatively or additionally, the target time description and the reference time information may each comprise a description of the excitation signal. The description of the excitation signal may include a description of a pitch component (e.g., a pitch lag, a pitch gain, and/or a description of a prototype).
Task T230 is generally configured to set the gain shape described by the target time to be flat. For example, task T230 may be configured to set the gain shape values of the target time descriptions to be equal to each other. One such implementation of task T230 is configured to set all gain shape values to a factor of 1 (e.g., 0 dB). Another such implementation of task T230 is configured to set all gain shape values to a factor of 1/n, where n is the number of gain shape values in the target time description.
Task T230 may be iterated to calculate a target time description for each of a series of target frames. For example, task T230 may be configured to calculate a gain frame value for each of a series of consecutive target frames based on the gain frame value from the most recent reference encoded frame. In such cases, task T230 may need to be configured to add random noise to the gain frame value of each target frame (or to the gain frame value of each target frame in the series after the first) because the temporal envelope of the series may otherwise be perceived as unnaturally smooth. This implementation of task T230 may be configured to be according to, for example, gt=zgrOr gt=wgr(1-w) z for each target frame in the series, a gain frame value g is calculatedtWherein g isrIs a gain frame value from a reference encoded frame, z is a random value reevaluated for each of the series of target frames, and w is a weighting factor. Typical ranges for the value of z include 0 to 1 and-1 to + 1. Typical ranges for the value of w include 0.5 (or 0.6) to 0.9 (or 1.0).
Task T230 may be configured to calculate a gain frame value for the target frame based on the gain frame values from the two or three most recently referenced encoded frames. In one such example, task T230 is configured according to, for exampleThe gain frame value of the target frame is calculated as an average value, where gr1Is from the latestGain frame value of near-reference coded frame and gr2Is the gain frame value from the next most recent reference encoded frame. In a related example, the reference gain frame values are weighted differently from one another (e.g., more recent values may be weighted more heavily). Task T230 may need to be implemented to calculate a gain frame value for each of a series of target frames based on this average. For example, such an implementation of task T230 may be configured to calculate a gain frame value for each target frame in the series (or, for each target frame in the series after the first) by adding a different random noise value to the calculated average gain frame value.
In another example, task T230 is configured to calculate the gain frame value for the target frame as a moving average of gain frame values from consecutive reference encoded frames. This implementation of task T230 may be configured to be according to, for example, gcur=αgprev+(1-α)grIs calculated as the current value of the moving average gain frame value, as a function of the Autoregressive (AR) expression of (a) a target gain frame value, where gcurAnd gprevFor smoothing factor α, it may be desirable to use a value between 0.5 or 0.75 and 1, such as zero eight (0.8) or zero nine (0.9), task T230 may need to be implemented to calculate a value g for each of a series of target frames based on this moving averaget. For example, such an implementation of task T230 may be configured to pass through to a moving average gain frame value gcurCalculating a value g for each target frame in the series (or for each target frame after the first in the series) adding a different random noise valuet。
In yet another example, task T230 is configured to apply an attenuation factor to the contribution from the reference time information. For example, task T230 may be configured to be in accordance with, for example, gcur=αgprev+(1-α)βgrWherein the attenuation factor β is an adjustable parameter having a value less than 1, e.g., between 0.5 and 1A value in the range of 0.9 (e.g., zero point six (0.6)). It may be desirable to implement task T230 to calculate a value g for each of a series of target frames based on this moving averaget. For example, such an implementation of task T230 may be configured to pass through to a moving average gain frame value gcurCalculating a value g for each target frame in the series (or for each target frame after the first in the series) adding a different random noise valuet。
Task T230 may need to be iterated to calculate a target spectrum and time description for each of a series of target frames. In this case, task T230 may be configured to update the target spectrum and time description at different rates. For example, such an implementation of task T230 may be configured to calculate a different target spectrum description for each target frame, but use the same target time description for more than one consecutive target frames.
Embodiments of method M200 (including methods M210 and M220) are generally configured to include operations to store reference spectral information to a buffer. This implementation of method M200 may also include the operation of storing the reference time information to a buffer. Alternatively, this implementation of method M200 may include the operation of storing both the reference spectrum information and the reference time information to a buffer.
Different implementations of method M200 may use different criteria in deciding whether to store encoded frame-based information as reference spectral information. The decision to store the reference spectral information is typically based on the coding scheme of the encoded frame and may also be based on the coding scheme of one or more previous and/or subsequent encoded frames. This implementation of method M200 may be configured to use the same or different criteria in deciding whether to store the reference time information.
It may be desirable to implement method M200 such that the stored reference spectral information may be used for more than one reference encoded frame at the same time. For example, task T230 may be configured to calculate a target spectral description based on information from more than one reference frame. In such cases, method M200 may be configured to maintain, at any one time, reference spectral information from the most recent reference encoded frame, information from the second most recent reference encoded frame, and (possibly) information from one or more less recent reference encoded frames in storage. This method may also be configured to maintain the same history or a different history for the reference time information. For example, method M200 may be configured to maintain a description of the spectral envelope from each of the two most recent reference encoded frames and a description of the temporal information only from the most recent reference encoded frames.
As mentioned above, each of the encoded frames may include a coding index that identifies the coding scheme or coding rate or mode according to which the frame is encoded. Alternatively, the speech decoder may be configured to determine at least a portion of the coding index from the encoded frame. For example, a speech decoder may be configured to determine a bit rate for an encoded frame from one or more parameters, such as frame energy. Similarly, for coding devices that support more than one coding mode for a particular coding rate, a speech decoder may be configured to determine the appropriate coding mode from the format of the encoded frame.
Not all encoded frames in the encoded speech signal will qualify as reference encoded frames. For example, an encoded frame that does not include a description of the spectral envelope over the second frequency band will generally not be suitable for use as a reference encoded frame. In some applications, it may be desirable to treat any encoded frame that contains a description of the spectral envelope over the second frequency band as a reference encoded frame.
A corresponding implementation of method M200 may be configured to store information based on the current encoded frame as reference spectral information if the frame contains a description of a spectral envelope over the second frequency band. For example, in the case of a set of coding schemes as shown in fig. 18, this implementation of method M200 may be configured to store the reference spectral information if the coding index of the frame indicates either of coding schemes 1 and 2 (i.e., not coding scheme 3). More generally, this implementation of method M200 may be configured to store the reference spectral information if the coding index of the frame indicates a wideband coding scheme rather than a narrowband coding scheme.
It may be desirable to implement method M200 to obtain the target spectrum description only for invalid target frames (i.e., perform task T230). In such cases, it may be desirable for the reference spectral information to be based only on the encoded inactive frames and not on the encoded active frames. Although the active frame includes background noise, the reference spectral information based on the encoded active frame will likely also include information related to speech components that may corrupt the target spectral description.
This implementation of method M200 may be configured to store information based on a current encoded frame as reference spectral information if the coding index of the frame indicates a particular coding mode (e.g., NELP). Other implementations of method M200 are configured to store information based on a current encoded frame as reference spectral information if the coding index of the frame indicates a particular coding rate (e.g., half-rate). Other implementations of method M200 are configured to store information based on the current encoded frame as reference spectral information according to a combination of the following criteria: for example, if the coding index of a frame indicates that the frame contains a description of the spectral envelope over the second frequency band and also indicates a particular coding mode and/or rate. Other implementations of method M200 are configured to store information based on a current encoded frame as reference spectral information if the coding index of the frame indicates a particular coding scheme (e.g., coding scheme 2 in the example according to fig. 18, or a wideband coding scheme reserved for inactive frames in another example).
It may not be possible to determine whether a frame is valid or invalid from its coding index alone. For example, in the set of encoding schemes shown in FIG. 18, encoding scheme 2 is used for both valid and invalid frames. In this case, the encoding index of one or more subsequent frames may help indicate whether the encoded frame is invalid. For example, the above description discloses several speech coding methods in which a frame encoded using coding scheme 2 is invalid if a subsequent frame is encoded using coding scheme 3. A corresponding implementation of method M200 may be configured to store information based on the current encoded frame as reference spectral information if the coding index of the current encoded frame indicates coding scheme 2 and the coding index of the next encoded frame indicates coding scheme 3. In a related example, an implementation of method M200 is configured to store information based on an encoded frame as reference spectral information if the encoded frame is encoded at half-rate and a next frame is encoded at eighth-rate.
For the case where the decision to store the encoded frame-based information as reference spectral information depends on information from a subsequent encoded frame, method M200 may be configured to perform the operation of storing the reference spectral information in two parts. A first portion of the storage operation temporarily stores information based on the encoded frame. This implementation of method M200 may be configured to temporarily store information for all frames or all frames that meet some predetermined criteria (e.g., all frames having a particular coding rate, mode, or scheme). Three different examples of this standard are (1) frames whose coding index indicates the NELP coding mode, (2) frames whose coding index indicates half rate, and (3) frames whose coding index indicates coding scheme 2 (e.g., in an application of a set of coding schemes according to FIG. 18).
The second part of the storing operation stores the temporarily stored information as the reference spectrum information in a case where a predetermined condition is satisfied. This implementation of method M200 may be configured to defer this portion of operation until one or more subsequent frames are received (e.g., until the encoding mode, rate, or scheme of the next encoded frame is known). Three different examples of this condition are (1) the code index of the next encoded frame indicates eighth-rate, (2) the code index of the next encoded frame indicates a coding mode for only inactive frames, and (3) the code index of the next encoded frame indicates coding scheme 3 (e.g., in an application according to the set of coding schemes of fig. 18). The temporarily stored information may be discarded or overwritten if the condition of the second portion of the store operation is not satisfied.
The second portion of the two-part operation to store reference spectral information may be implemented according to any of a number of different configurations. In one example, the second portion of the store operation is configured to change the state of a flag associated with the storage location holding the temporarily stored information (e.g., from a state indicating "temporary" to a state indicating "reference"). In another example, the second portion of the storage operation is configured to transfer the temporarily stored information to a buffer reserved for storing reference spectrum information. In yet another example, the second portion of the storage operation is configured to update one or more pointers to a buffer (e.g., a ring buffer) that holds the temporarily stored reference spectrum information. In this case, the pointers may comprise a read pointer indicating a location of reference spectral information from the most recent reference encoded frame and/or a write pointer indicating a location where temporarily stored information is to be stored.
FIG. 31 shows a corresponding portion of a state diagram for a speech decoder configured to perform an implementation of method M200, where the encoding scheme of the subsequent encoded frame is used to determine whether to store information based on the encoded frame as reference spectral information. In this figure, the path label indicates the frame type associated with the coding scheme of the current frame, where a indicates the coding scheme for valid frames only, I indicates the coding scheme for invalid frames only, and M (representing "hybrid") indicates the coding scheme for valid frames and for invalid frames. For example, such a decoder may be included in an encoding system that uses a set of encoding schemes as shown in fig. 18, where schemes 1, 2, and 3 correspond to path labels A, M and I, respectively. As shown in fig. 31, information is temporarily stored for all encoded frames having an encoding index indicating a "hybrid" encoding scheme. Storing the temporarily stored information as the reference spectral information is done if the coding index of the next frame indicates that the frame is invalid. Otherwise, the temporarily stored information may be discarded or overwritten.
It is expressly noted that the previous discussion relating to selective and temporary storage of reference spectrum information, and the accompanying state diagram of fig. 31, may also apply to reference time information storage in an implementation of method M200 configured to store reference time information.
In a typical application of an implementation of method M200, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions) embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other non-volatile memory cards, semiconductor memory chips, etc.) that is readable and/or executable by a machine (e.g., a computer) that includes an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of method M200 may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communication, such as a cellular telephone or other device having such communication capabilities. Such a device may be configured to communicate with a circuit-switched and/or packet-switched network (e.g., using one or more protocols such as VoIP). For example, such a device may comprise RF circuitry configured to receive encoded frames.
FIG. 32A shows a block diagram of an apparatus 200 for processing an encoded speech signal according to a general configuration. For example, apparatus 200 may be configured to perform a speech decoding method comprising an implementation of method M200 as described herein. The apparatus 200 includes control logic 210 configured to generate a control signal having a sequence of values. Apparatus 200 also includes a speech decoder 220 configured to calculate a decoded frame of the speech signal based on the value of the control signal and based on a corresponding encoded frame of the encoded speech signal.
A communication device, such as a cellular telephone, comprising apparatus 200 may be configured to receive encoded voice signals from a wired, wireless, or optical transmission channel. Such a device may be configured to perform pre-processing operations on the encoded speech signal, such as decoding of error correction and/or redundant codes. This apparatus may also include an implementation of both apparatus 100 and apparatus 200 (e.g., in a transceiver).
Control logic 210 is configured to generate a control signal comprising a sequence of values that is based on a coding index of an encoded frame of an encoded speech signal. Each value in the sequence corresponds to an encoded frame of the encoded speech signal (except in the case of erased frames as discussed below) and has one of a plurality of states. In some embodiments of the apparatus 200 as described below, the sequence is binary-valued (i.e., a sequence of high and low values). In other implementations of the apparatus 200 as described below, the sequence of values may have more than two states.
Control logic 210 may be configured to determine a coding index for each encoded frame. For example, control logic 210 may be configured to read at least a portion of an encoding index from an encoded frame, determine a bit rate for the encoded frame from one or more parameters (e.g., frame energy), and/or determine an appropriate encoding mode from a format of the encoded frame. Alternatively, apparatus 200 may be implemented to include another element configured to determine and provide the code index for each encoded frame to control logic 210, or apparatus 200 may be configured to receive the code index from another module of a device that includes apparatus 200.
An encoded frame that is not received as expected or that is received with too many errors to recover is referred to as a frame erasure. The apparatus 200 may be configured such that one or more states of the encoding index are used to indicate a frame erasure or a partial frame erasure, such as an absence of a portion of the encoded frame that carries spectral and temporal information for the second frequency band. For example, apparatus 200 may be configured such that the encoding index of an encoded frame that has been encoded using encoding scheme 2 indicates erasure of the high-band portion of the frame.
The speech decoder 220 is configured to calculate a decoded frame based on a value of the control signal and a corresponding encoded frame of the encoded speech signal. When the value of the control signal has a first state, decoder 220 computes a decoded frame based on a description of the spectral envelope over the first and second frequency bands, where the description is based on information from the corresponding encoded frame. When the value of the control signal has the second state, decoder 220 retrieves a description of the spectral envelope over the second frequency band and computes a decoded frame based on the retrieved description and based on the description of the spectral envelope over the first frequency band, wherein the description over the first frequency band is based on information from the corresponding encoded frame.
FIG. 32B shows a block diagram of an implementation 202 of apparatus 200. The apparatus 202 includes an implementation 222 of a speech decoder 220 that includes a first module 230 and a second module 240. Modules 230 and 240 are configured to calculate respective subband portions of a decoded frame. Specifically, the first module 230 is configured to calculate a decoded portion of the frame over a first frequency band (e.g., a narrowband signal), and the second module 240 is configured to calculate a decoded portion of the frame over a second frequency band (e.g., a highband signal) based on a value of the control signal.
FIG. 32C shows a block diagram of an implementation 204 of apparatus 200. Parser 250 is configured to parse bits of the encoded frame in order to provide the control logic 210 with an encoding index and to provide the speech decoder 220 with at least one description of a spectral envelope. In this example, apparatus 204 is also an implementation of apparatus 202, such that profiler 250 is configured to provide modules 230 and 240 with a description of the spectral envelope over the respective frequency band (when available). Parser 250 may also be configured to provide at least one description of the temporal information to speech decoder 220. For example, parser 250 may be implemented to provide modules 230 and 240 with a description of the time information for the respective frequency bands (when available).
Apparatus 204 also includes a filter bank 260 configured to combine decoded portions of the frame over the first and second frequency bands to produce a wideband speech signal. Specific examples of such filter banks are described in, for example, U.S. patent application publication No. 2007/088558 entitled "system, method, and apparatus for speech signal filtering" (waters (Vos), et al), which is published, for example, on 19/4/2007. For example, filter bank 260 may include a low pass filter configured to filter the narrowband signal to produce a first passband signal and a high pass filter configured to filter the highband signal to produce a second passband signal. Filterbank 260 may also include an up-converter sampler configured to increase the sampling rate of the narrowband signal and/or the highband signal according to a desired corresponding interpolation factor, as described, for example, in U.S. patent application publication No. 2007/088558 (waters (Vos), et al).
Fig. 33A shows a block diagram of an implementation 232 of the first module 230, the implementation 232 comprising an instance 270a of a spectral envelope description decoder 270 and an instance 280a of a temporal information description decoder 280. The spectral envelope description decoder 270a is configured to decode a description of a spectral envelope over a first frequency band (e.g., as received from the parser 250). The temporal information description decoder 280a is configured to decode a description of temporal information for the first frequency band (e.g., as received from the parser 250). For example, the time information description decoder 280a may be configured to decode the excitation signal for the first frequency band. An example 290a of synthesis filter 290 is configured to generate a decoded portion of a frame (e.g., a narrowband signal) over a first frequency band that is based on a decoded description of a spectral envelope and temporal information. For example, synthesis filter 290a may be configured to generate a decoded portion in response to an excitation signal for a first frequency band according to a set of values (e.g., one or more LSP or LPC coefficient vectors) within a description of a spectral envelope over the first frequency band.
Fig. 33B shows a block diagram of an implementation 272 of a spectral envelope description decoder 270. Dequantizer 310 is configured to dequantize the description, and inverse transform block 320 is configured to apply an inverse transform to the dequantized description in order to obtain a set of LPC coefficients. The temporal information description decoder 280 is also typically configured to include a dequantizer.
FIG. 34A shows a block diagram of an implementation 242 of the second module 240. The second module 242 comprises an instance 270b of the spectral envelope description decoder 270, the buffer 300 and the selector 340. The spectral envelope description decoder 270b is configured to decode a description of a spectral envelope over a second frequency band (e.g., as received from the parser 250). Buffer 300 is configured to store one or more descriptions of spectral envelopes over the second frequency band as reference spectral information, and selector 340 is configured to select a decoded description of a spectral envelope from either (a) buffer 300 or (B) decoder 270B, depending on the state of a corresponding value of a control signal generated by control logic 210.
The second module 242 also includes an instance 290b of the high-band excitation signal generator 330 and the synthesis filter 290, the instance 290b configured to generate a decoded portion of the frame (e.g., the high-band signal) on a second frequency band based on the decoded description of the spectral envelope received via the selector 340. The high-band excitation signal generator 330 is configured to generate an excitation signal for the second frequency band based on the excitation signal for the first frequency band (e.g., as generated by the time information description decoder 280 a). Additionally or alternatively, generator 330 may be configured to perform spectral and/or amplitude shaping of random noise to generate the high-band excitation signal. The generator 330 may be implemented as an example of the high-band excitation signal generator a60 as described above. The synthesis filter 290b is configured according to a set of values (e.g., one or more LSP or LPC coefficient vectors) within a description of the spectral envelope over the second frequency band to generate a decoded portion of the frame over the second frequency band in response to the highband excitation signal.
In one example of an implementation of the apparatus 202 that includes the implementation 242 of the second module 240, the control logic 210 is configured to output a binary signal to the selector 340 such that each value in the sequence has either state a or state B. In this case, if the coding index of the current frame indicates that it is invalid, control logic 210 generates a value having state A, which causes selector 340 to select the output of buffer 300 (i.e., select A). Otherwise, control logic 210 generates a value having state B, which causes selector 340 to select the output of decoder 270B (i.e., select B).
The apparatus 202 may be arranged such that the control logic 210 controls the operation of the buffer 300. For example, the buffer 300 may be arranged such that a value of the control signal having state B causes the buffer 300 to store the corresponding output of the decoder 270B. This control may be implemented by applying a control signal to a write enable input of buffer 300, where the input is configured such that state B corresponds to its active state. Alternatively, control logic 210 may be implemented to generate a second control signal to control the operation of buffer 300 that also includes a sequence of values that is based on the coding index of the encoded frame of the encoded speech signal.
FIG. 34B shows a block diagram of an implementation 244 of the second module 240. The second module 244 includes a spectral envelope description decoder 270b and an instance 280b of a temporal information description decoder 280, the instance 280b configured to decode a description of temporal information for the second frequency band (e.g., as received from the parser 250). The second module 244 also includes an implementation 302 of the buffer 300 that is also configured to store one or more descriptions of time information on the second frequency band as reference time information.
The second module 244 includes an implementation 342 of the selector 340 that is configured to select a decoded description of the spectral envelope and a decoded description of the temporal information from either (a) the buffer 302 or (B) the decoders 270B, 280B according to a state of a corresponding value of a control signal generated by the control logic 210. An example 290b of the synthesis filter 290 is configured to produce a decoded portion of the frame on a second frequency band (e.g., a highband signal) based on the decoded description of the spectral envelope and temporal information received via the selector 342. In a typical implementation of the apparatus 202 that includes the second module 244, the temporal information description decoder 280b is configured to generate a decoded description of the temporal information that includes the excitation signal for the second frequency band, and the synthesis filter 290b is configured to generate a decoded portion of the frame over the second frequency band in response to the excitation signal according to a set of values (e.g., one or more LSP or LPC coefficient vectors) within the description of the spectral envelope over the second frequency band.
Fig. 34C shows a block diagram of an implementation 246 of the second module 242 including the buffer 302 and the selector 342. The second module 246 further includes: an instance 280c of a temporal information description decoder 280 configured to decode a description of a temporal envelope for a second frequency band; and a gain control element 350 (e.g., a multiplier or amplifier) configured to apply the description of the temporal envelope received via the selector 342 to a decoded portion of the frame on the second frequency band. For the case where the decoded description of the temporal envelope includes a gain shape value, gain control element 350 may include logic configured to apply the gain shape value to a respective subframe of the decoded portion.
Fig. 34A-34C show an implementation of the second module 240, in which the buffer 300 receives a fully decoded description of the spectral envelope (and, in some cases, temporal information). A similar implementation may be arranged such that buffer 300 receives descriptions that are not fully decoded. For example, it may be desirable to reduce storage space requirements by storing descriptions in a quantized form (e.g., as received from profiler 250). In such cases, the signal path from buffer 300 to selector 340 may be configured to include decoding logic such as a dequantizer and/or an inverse transform block.
FIG. 35A shows a state diagram in which an implementation of control logic 210 may be configured to operate according to. In this figure, the path label indicates the frame type associated with the coding scheme of the current frame, where a indicates the coding scheme for valid frames only, I indicates the coding scheme for invalid frames only, and M (representing "hybrid") indicates the coding scheme for valid frames and for invalid frames. For example, such a decoder may be included in an encoding system that uses a set of encoding schemes as shown in fig. 18, where schemes 1, 2, and 3 correspond to path labels A, M and I, respectively. The state tags in fig. 35A indicate the states of the corresponding values of the control signals.
As mentioned above, the apparatus 202 may be arranged such that the control logic 210 controls the operation of the buffer 300. For the case where apparatus 202 is configured to perform the operation of storing reference spectrum information in two parts, control logic 210 may be configured to control buffer 300 to perform a selected one of three different tasks: (1) temporarily storing information based on the encoded frame; (2) completing the storing of the temporarily stored information as reference spectrum and/or time information; and (3) outputting the stored reference spectrum and/or time information.
In one such example, control logic 210 is implemented to generate control signals that control the operation of selector 340 and buffer 300, the values of which have at least four possible states, each corresponding to a respective state of the graph shown in FIG. 35A. In another such example, control logic 210 is implemented to generate: (1) a control signal to control the operation of the selector 340, the value of which has at least two possible states; and (2) a second control signal to control the operation of buffer 300 that includes a sequence of values based on the coding index of the coded frame of the coded speech signal and whose values have at least three possible states.
It may be desirable to configure buffer 300 so that during processing of a frame for which the selection completes the operation of storage of the temporarily stored information, the temporarily stored information is also selected from selector 340. In this case, control logic 210 may be configured to output the current values of the signals at slightly different times to control selector 340 and buffer 300. For example, control logic 210 may be configured to control buffer 300 to move the read pointer early enough in the frame period so that buffer 300 outputs the temporarily stored information in time for selection by selector 340.
As mentioned above with reference to FIG. 13B, it may sometimes be desirable for a speech encoder performing an implementation of method M100 to use a higher bit rate to encode inactive frames surrounded by other inactive frames. In this case, the corresponding speech decoder may be required to store information based on the encoded frame as reference spectral and/or temporal information so that the information can be used to decode future invalid frames in the series.
The various elements of the implementation of apparatus 200 may be embodied in any combination of hardware, software, and/or firmware as deemed suitable for the desired application. Such elements may be fabricated, for example, as electronic and/or optical devices residing, for example, on the same chip or between two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such array(s) may be implemented within one or more chips (e.g., within a chipset comprising two or more chips).
One or more elements of various implementations of the apparatus 200 as described herein may also be implemented, in whole or in part, as one or more sets of instructions arranged to be executed on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field programmable gate arrays), ASSPs (application specific standard products), and ASICs (application specific integrated circuits). Any of the various elements of an implementation of apparatus 200 may also be embodied as one or more computers (e.g., a machine comprising one or more arrays programmed to execute one or more sets or sequences of instructions, also referred to as a "processor"), and any two or more, or even all, of these elements may be implemented within the same such computer(s).
Various elements of an implementation of apparatus 200 may be included within a device for wireless communication, such as a cellular telephone or other device having such communication capabilities. Such a device may be configured to communicate with a circuit-switched and/or packet-switched network (e.g., using one or more protocols such as VoIP). Such a device may be configured to perform operations on a signal carrying an encoded frame, such as deinterleaving, depuncturing, decoding of one or more convolutional codes, decoding of one or more error correction codes, decoding of one or more network protocol (e.g., ethernet, TCP/IP, cdma2000) layers, Radio Frequency (RF) demodulation, and/or RF reception.
It may be possible for one or more elements of an implementation of apparatus 200 to be used to perform tasks or other sets of instructions that are not directly related to the operation of the apparatus, such as tasks related to another operation of a device or system in which the apparatus is embedded. It is also possible for one or more elements of an implementation of apparatus 200 to have a common structure (e.g., a processor to execute portions of code corresponding to different elements at different times, a set of instructions executed to perform tasks corresponding to different elements at different times, or an arrangement of electronic and/or optical devices performing operations of different elements at different times). In one such example, control logic 210, first module 230, and second module 240 are implemented as sets of instructions arranged to execute on the same processor. In another such example, spectral envelope description decoders 270a and 270b are implemented as the same set of instructions executed at different times.
A device for wireless communication, such as a cellular telephone or other device having such communication capabilities, may be configured to include implementations of both apparatus 100 and apparatus 200. In this case, it is possible to make the apparatus 100 and the apparatus 200 have a common structure. In one such example, apparatus 100 and apparatus 200 are implemented to comprise sets of instructions arranged to execute on the same processor.
At any time during full duplex telephone communication, it can be expected that the input to at least one of the vocoders will be an invalid frame. It may be desirable to configure a speech encoder to transmit encoded frames for less than all of a series of inactive frames. This operation is also referred to as Discontinuous Transmission (DTX). In one example, the speech encoder performs DTX by transmitting one encoded frame (also referred to as a "silence descriptor" or SID) for each string of n consecutive inactive frames, where n is 32. The corresponding decoder applies the information in the SID to update the noise generation model used by the comfort noise generation algorithm to synthesize the null frames. Other typical values of n include 8 and 16. Other names used in the art to indicate a SID include "update to silence description", "silence insertion descriptor", "comfort noise descriptor frame", and "comfort noise parameters".
It can be appreciated that in an implementation of method M200, the reference encoded frame is similar to the SID in that it both provides an untimely update to the silence description of the highband portion of the speech signal. Although the potential advantages of DTX are generally greater in packet-switched networks than in circuit-switched networks, it is explicitly noted that methods M100 and M200 are applicable to both circuit-switched networks and packet-switched networks.
Implementations of method M100 may be combined with DTX (e.g., in a packet-switched network) such that encoded frames are transmitted for less than all invalid frames. A speech encoder performing this method may be configured to occasionally transmit a SID at some regular interval (e.g., every eight, sixteen, or thirty-two frames in a series of inactive frames) or after some event. Fig. 35B shows an example where the SID is transmitted every six frames. In this case, the SID includes a description of the spectral envelope over the first frequency band.
A corresponding implementation of method M200 may be configured to generate a frame based on the reference spectral information in response to a failure to receive an encoded frame during a frame period following an invalid frame. As shown in fig. 35B, this implementation of method M200 may be configured to obtain, for each intervening inactive frame, a description of the spectral envelope over the first frequency band based on information from one or more received SIDs. For example, such operation may include interpolation between descriptions of spectral envelopes from the two most recent SIDs, as in the example shown in fig. 30A-30C. For the second band, the method may be configured to obtain a description of the spectral envelope (and possibly a description of the temporal envelope) for each intervening inactive frame based on information from one or more recent reference encoded frames (e.g., according to any of the examples described herein). This method may also be configured to generate an excitation signal for the second frequency band that is based on the excitation signal for the first frequency band from the one or more recent SIDs.
The previous presentation of the described configurations is provided to enable any person skilled in the art to make or use the methods and other structures disclosed herein. The flow diagrams, block diagrams, state diagrams, and other structures shown and described herein are examples only, and other variations of these structures are within the scope of the present invention. Various modifications to these configurations are possible, and the generic principles presented herein may be applied to other configurations as well. For example, the various elements and tasks described herein for processing the high-band portion of a speech signal that includes frequencies above the range of the narrow-band portion of the speech signal may alternatively or additionally and in a similar manner be applied to processing the low-band portion of the speech signal that includes frequencies below the range of the narrow-band portion of the speech signal. In this case, the low-band excitation signal may be derived from the narrowband excitation signal using the disclosed techniques and structures for deriving a high-band excitation signal from a narrowband excitation signal. Thus, the present invention is not intended to be limited to the configurations shown above but is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein (including in the appended claims, as filed), which form a part of the original disclosure.
Examples of codecs that may be used with or suitable for use with a speech encoder, speech encoding method, speech decoder and/or speech decoding method as described herein include: enhanced Variable Rate Codec (EVRC) as described in document 3gpp2c.s0014-C version 1.0 "enhanced variable rate codec for wideband spread spectrum digital systems, voice service options3, 68 and70 (enhanced variable ratecode, speech services options3, 68, and70for wireless broadband spread spectrum digital systems)" (third generation partnership project 2, argonton, VA, 2007 month 1); an adaptive multi-rate (AMR) speech codec as described in document ETSITS126092V6.0.0 (european telecommunications standards institute (ETSI), sufi-ontobrevis city, france (sophia antipolis cedex, FR), month 12 2004); and AMR wideband speech codec as described in document ETSITS126192V6.0.0(ETSI, 12 months 2004).
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, and symbols that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof. While the signal from which the encoded frame is derived is referred to as a "speech signal," it is also contemplated and thus disclosed that such a signal may carry music or other non-speech information content during active frames.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and operations described in connection with the configurations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Such logic blocks, modules, circuits, and operations may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The tasks of the methods and algorithms described herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An illustrative storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
Each of the configurations described herein may be implemented, at least in part, as a hardwired circuit, a circuit configuration manufactured into an application specific integrated circuit, or a firmware program loaded into non-volatile storage or a software program loaded from or into a data storage medium as machine-readable code (such code being instructions executable by an array of logic elements, such as a microprocessor or other digital signal processing unit). The data storage medium may be an array of storage elements such as semiconductor memory (which may include, but is not limited to, dynamic or static RAM (random access memory), ROM (read only memory), and/or flash RAM) or ferroelectric, magnetoresistive, ovonic, polymeric, or phase change memory; or a disk medium such as a magnetic disk or an optical disk. The term "software" should be understood to include source code, assembly language code, machine code, binary code, firmware, macrocode, microcode, any one or more sets or sequences of instructions executable by an array of logic elements, and any combination of such examples.
Claims (34)
1. An apparatus for encoding frames of a speech signal, the apparatus comprising:
a voice activity detector configured to indicate, for each of a plurality of frames of the voice signal, whether the frame is valid or invalid;
a coding scheme selector configured to
(A) Selecting a first coding scheme in response to an indication of a first frame of the speech signal by the speech activity detector,
(B) for a second frame that is one of a consecutive series of inactive frames that occur after the first frame and in response to an indication by the speech activity detector that the second frame is inactive, and
(C) selecting a third encoding scheme for a third frame that follows the second frame in the speech signal and that is another of the consecutive series of inactive frames that occurs after the first frame and in response to an indication by the speech activity detector that the third frame is inactive; and a speech encoder configured to
(D) Generating a first encoded frame based on the first frame and having a length of p bits, wherein p is a non-zero positive integer,
(E) according to the second coding scheme, a second encoded frame is generated that is based on the second frame and has a length of q bits, where q is a non-zero positive integer other than p, and
(F) according to the third coding scheme, generating a third encoded frame, the third encoded frame being based on the third frame and having a length of r bits, where r is a non-zero positive integer less than q,
wherein the speech encoder is configured to generate the second encoded frame to include (a) a description of a spectral envelope of a portion of the speech signal that includes the second frame over a first frequency band and (b) a description of a spectral envelope of a portion of the speech signal that includes the second frame over a second frequency band different from the first frequency band.
2. The apparatus according to claim 1, wherein in the speech signal at least one frame occurs between the first frame and the second frame.
3. The apparatus according to claim 1, wherein said speech encoder is configured to produce the third encoded frame (a) to include a description of a spectral envelope over the first frequency band and (b) to not include a description of a spectral envelope over the second frequency band.
4. The apparatus according to claim 1, wherein said speech encoder is configured to produce the third encoded frame to include a description of a spectral envelope of a portion of the speech signal that includes the third frame.
5. A method of processing an encoded speech signal, the method comprising:
obtaining, based on information from a first encoded frame of the encoded speech signal, a description of a spectral envelope of the first frame of the speech signal over (A) a first frequency band and (B) a second frequency band different from the first frequency band;
obtaining, based on information from a second encoded frame of the encoded speech signal, a description of a spectral envelope of the second frame of the speech signal over the first frequency band, wherein the first frame and the second frame are inactive frames; and
obtaining, based on information from the first encoded frame, a description of a spectral envelope of the second frame over the second frequency band;
wherein the second frame occurs after the first frame, wherein the first frame and the second frame are discontinuous frames of the speech signal, and wherein all frames of the speech signal between the first frame and the second frame are inactive frames.
6. The method of processing an encoded speech signal according to claim 5, wherein said obtaining a description of a spectral envelope, over the first frequency band, of a second frame of the speech signal is based, at least predominantly, on information from the second encoded frame.
7. The method of processing an encoded speech signal according to claim 5, wherein said obtaining a description of a spectral envelope of the second frame over the second frequency band is based, at least predominantly, on information from the first encoded frame.
8. The method of processing an encoded speech signal according to claim 5, wherein said description of a spectral envelope of a first frame comprises a description of a spectral envelope of the first frame over the first frequency band and a description of a spectral envelope of the first frame over the second frequency band.
9. The method of processing an encoded speech signal according to claim 5, wherein said information from which a description of a spectral envelope of the second frame over the second frequency band is obtained comprises said description of a spectral envelope of the first frame over the second frequency band.
10. The method of processing an encoded speech signal according to claim 5, wherein the first encoded frame is encoded according to a wideband encoding scheme, and wherein the second encoded frame is encoded according to a narrowband encoding scheme.
11. The method of processing an encoded speech signal according to claim 5, wherein the length in bits of the first encoded frame is at least twice the length in bits of the second encoded frame.
12. The method of processing an encoded speech signal according to claim 5, said method comprising calculating the second frame based on the description of the spectral envelope of the second frame over the first frequency band, the description of the spectral envelope of the second frame over the second frequency band, and an excitation signal that is based, at least predominantly, on a random noise signal.
13. The method of processing an encoded speech signal according to claim 5, wherein said obtaining a description of a spectral envelope of the second frame over the second frequency band is based on information from a third encoded frame of the encoded speech signal, wherein both the first and third encoded frames occur in the encoded speech signal before the second encoded frame.
14. The method of processing an encoded speech signal according to claim 13, wherein said information from a third encoded frame comprises a description of a spectral envelope, over the second frequency band, of a third frame of the speech signal.
15. The method of processing an encoded speech signal according to claim 13, wherein said description of the spectral envelope of the first frame over the second frequency band comprises a vector of spectral parameter values, and
wherein the description of the spectral envelope of the third frame over the second frequency band comprises a vector of spectral parameter values, and
wherein the obtaining a description of a spectral envelope of the second frame over the second frequency band comprises calculating a vector of spectral parameter values of the second frame as a function of the vector of spectral parameter values of the first frame and the vector of spectral parameter values of the third frame.
16. The method of processing an encoded speech signal according to claim 13, said method comprising:
in response to detecting that an encoding index of the first encoded frame meets at least one predetermined criterion, storing the information from the first encoded frame from which the description of the spectral envelope of the second frame over the second frequency band is obtained;
in response to detecting that an encoding index of the third encoded frame meets at least one predetermined criterion, storing the information from the third encoded frame from which the description of the spectral envelope of the second frame over the second frequency band was obtained; and
in response to detecting that an encoding index of the second encoded frame satisfies at least one predetermined criterion, retrieving the stored information from the first encoded frame and the stored information from the third encoded frame.
17. The method of processing an encoded speech signal according to claim 5, said method comprising, for each of a plurality of frames of the speech signal that follow the second frame, obtaining a description of a spectral envelope of the frame over the second frequency band, wherein said description is based on information from the first encoded frame.
18. The method of processing an encoded speech signal according to claim 5, said method comprising, for each of a plurality of frames of the speech signal that follow the second frame: (C) obtaining a description of a spectral envelope of the frame over the second frequency band, wherein the description is based on information from the first encoded frame; and (D) obtaining a description of a spectral envelope of the frame over the first frequency band, wherein the description is based on information from the second encoded frame.
19. The method of processing an encoded speech signal according to claim 5, said method comprising obtaining an excitation signal, over the second frequency band, for the second frame based on an excitation signal, over the first frequency band, for the second frame.
20. The method of processing an encoded speech signal according to claim 5, said method comprising obtaining, based on information from the first encoded frame, a description of temporal information for the second frequency band for the second frame.
21. The method of processing an encoded speech signal according to claim 5, wherein said description of temporal information of the second frame comprises a description of a temporal envelope of the second frame for the second frequency band.
22. An apparatus for processing an encoded speech signal, the apparatus comprising:
means for obtaining, based on information from a first encoded frame of the encoded speech signal, a description of a spectral envelope of the first frame of the speech signal over (A) a first frequency band and (B) a second frequency band different from the first frequency band;
means for obtaining, based on information from a second encoded frame of the encoded speech signal, a description of a spectral envelope of the second frame of the speech signal over the first frequency band, wherein the first frame and the second frame are inactive frames; and
means for obtaining a description of a spectral envelope of the second frame over the second frequency band based on information from the first encoded frame;
wherein the second frame occurs after the first frame, wherein the first frame and the second frame are discontinuous frames of the speech signal, and wherein all frames of the speech signal between the first frame and the second frame are inactive frames.
23. The apparatus for processing an encoded speech signal according to claim 22, wherein said description of a spectral envelope of a first frame includes a description of a spectral envelope of the first frame over the first frequency band and a description of a spectral envelope of the first frame over the second frequency band, and
wherein the means for obtaining a description of a spectral envelope of the second frame over the second frequency band is configured to obtain the information on which the description is based comprises the description of a spectral envelope of the first frame over the second frequency band.
24. The apparatus for processing an encoded speech signal according to claim 22, wherein said means for obtaining a description of a spectral envelope of the second frame over the second frequency band is configured to obtain the description based on information from a third encoded frame of the encoded speech signal, wherein both the first and third encoded frames occur in the encoded speech signal before the second encoded frame, and
wherein the information from the third encoded frame comprises a description of a spectral envelope of a third frame of the speech signal over the second frequency band.
25. The apparatus for processing an encoded speech signal according to claim 22, said apparatus comprising means for obtaining, for each of a plurality of frames of the speech signal that follow the second frame, a description of a spectral envelope of the frame over the second frequency band, said description being based on information from the first encoded frame.
26. The apparatus for processing an encoded speech signal according to claim 22, said apparatus comprising:
means for obtaining, for each of a plurality of frames of the speech signal that follow the second frame, a description of a spectral envelope of the frame over the second frequency band, the description being based on information from the first encoded frame; and
means for obtaining, for each of the plurality of frames, a description of a spectral envelope of the frame over the first frequency band, the description being based on information from the second encoded frame.
27. The apparatus for processing an encoded speech signal according to claim 22, said apparatus comprising means for obtaining an excitation signal, over the second frequency band, for the second frame based on an excitation signal, over the first frequency band, for the second frame.
28. The apparatus for processing an encoded speech signal according to claim 22, said apparatus comprising means for obtaining, based on information from the first encoded frame, a description of temporal information for the second frequency band for the second frame,
wherein the description of temporal information of the second frame comprises a description of a temporal envelope of the second frame for the second frequency band.
29. An apparatus for processing an encoded speech signal, the apparatus comprising:
control logic configured to generate a control signal comprising a sequence of values that is based on an encoding index of an encoded frame of the encoded speech signal, each value in the sequence corresponding to an encoded frame of the encoded speech signal; and
a speech decoder configured to (a) calculate, in response to a value of the control signal having a first state, a decoded frame based on the following description: a description of a spectral envelope over the first and second frequency bands, the description being based on information from a corresponding encoded frame, and (B) in response to a value of the control signal having a second state different from the first state, calculating a decoded frame based on the following description: (1) a description of a spectral envelope over the first frequency band, the description based on information from a corresponding encoded frame, and (2) a description of a spectral envelope over the second frequency band, the description based on information from at least one encoded frame that occurs in the encoded speech signal before the corresponding encoded frame;
wherein the corresponding encoded frame and the at least one encoded frame are inactive frames, and wherein all encoded frames of the encoded speech signal between the corresponding encoded frame and the at least one encoded frame are inactive frames.
30. The apparatus for processing an encoded speech signal according to claim 29, wherein said speech decoder is configured to, in response to a value of the control signal having the second state, calculate the description of the spectral envelope, over the second frequency band, by which a decoded frame is based on information from each of at least two encoded frames that occur in the encoded speech signal before a corresponding encoded frame.
31. The apparatus for processing an encoded speech signal according to claim 29, wherein said control logic is configured to, in response to failing to receive an encoded frame within a corresponding frame period, generate a value of the control signal that has a third state that is different than the first and second states, and
wherein the speech decoder is configured to (C) calculate, in response to a value of the control signal having the third state, a decoded frame based on the following description: (1) a description of a spectral envelope of the frame over the first frequency band, the description based on information from a most recently received encoded frame; and (2) a description of a spectral envelope of the frame over the second frequency band, the description being based on information from an encoded frame that occurs in the encoded speech signal prior to the most recently received encoded frame.
32. The apparatus for processing an encoded speech signal according to claim 29, wherein said speech decoder is configured to calculate, in response to a value of the control signal having the second state and based on an excitation signal of the decoded frame that is on the first frequency band, an excitation signal of the decoded frame that is on the second frequency band.
33. The apparatus for processing an encoded speech signal according to claim 29, wherein said speech decoder is configured to, in response to a value of the control signal having the second state, calculate the decoded frame based on a description of a temporal envelope for the second frequency band, the description being based on information from at least one encoded frame that occurs in the encoded speech signal before a corresponding encoded frame.
34. The apparatus for processing an encoded speech signal according to claim 29, wherein said speech decoder is configured to, in response to a value of the control signal having the second state, calculate the decoded frame based on an excitation signal that is based, at least predominantly, on a random noise signal.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US83468806P | 2006-07-31 | 2006-07-31 | |
| US60/834,688 | 2006-07-31 | ||
| US11/830,812 US8260609B2 (en) | 2006-07-31 | 2007-07-30 | Systems, methods, and apparatus for wideband encoding and decoding of inactive frames |
| US11/830,812 | 2007-07-30 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1184589A1 HK1184589A1 (en) | 2014-01-24 |
| HK1184589B true HK1184589B (en) | 2016-10-14 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CA2778790C (en) | Systems, methods, and apparatus for wideband encoding and decoding of inactive frames | |
| US8532984B2 (en) | Systems, methods, and apparatus for wideband encoding and decoding of active frames | |
| US8135047B2 (en) | Systems and methods for including an identifier with a packet associated with a speech signal | |
| US9653088B2 (en) | Systems, methods, and apparatus for signal encoding using pitch-regularizing and non-pitch-regularizing coding | |
| US9454974B2 (en) | Systems, methods, and apparatus for gain factor limiting | |
| US20060271356A1 (en) | Systems, methods, and apparatus for quantization of spectral envelope representation | |
| US10141001B2 (en) | Systems, methods, apparatus, and computer-readable media for adaptive formant sharpening in linear prediction coding | |
| CN101496099A (en) | Systems, methods, and apparatus for wideband encoding and decoding of active frames | |
| HK1184589B (en) | Systems, methods, and apparatus for wideband encoding and decoding of inactive frames |