[go: up one dir, main page]

CN106024017A - Voice detection method and device - Google Patents

Voice detection method and device Download PDF

Info

Publication number
CN106024017A
CN106024017A CN201510119374.XA CN201510119374A CN106024017A CN 106024017 A CN106024017 A CN 106024017A CN 201510119374 A CN201510119374 A CN 201510119374A CN 106024017 A CN106024017 A CN 106024017A
Authority
CN
China
Prior art keywords
cepstrum
speech detection
frame
voiced frame
voiced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510119374.XA
Other languages
Chinese (zh)
Inventor
孙廷玮
林福辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Priority to CN201510119374.XA priority Critical patent/CN106024017A/en
Publication of CN106024017A publication Critical patent/CN106024017A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a voice detection method and device, and the method comprises the steps: enabling collected voice signals to be overlapped and framed, and obtaining a plurality of corresponding sound frames; carrying out the windowing of the plurality of obtained sound frames; carrying out the frequency domain conversion of the sound frames after windowing, and obtaining frequency spectrums corresponding to the sound frames; carrying out the cepstrum domain conversion of the frequency spectrums corresponding to the obtained sound frames, and obtaining corresponding cepstrums; calculating the cepstrum distance between the cepstrums of two adjacent sound frames; and carrying out the voice detection of the collected signals when the calculated cepstrum distance is greater than a preset distance threshold value. According to the above scheme, the method can save the time of voice detection.

Description

Speech detection method and device
Technical field
The present invention relates to speech detection technical field, particularly relate to a kind of speech detection method and device.
Background technology
Mobile terminal, refers to the computer equipment that can use in movement, include in a broad aspect mobile phone, Notebook, panel computer, POS, vehicle-mounted computer etc..Along with developing rapidly of integrated circuit technique, move Dynamic terminal has had powerful disposal ability, and mobile terminal becomes one from simple call instrument Individual integrated information processing platform, this also adds broader development space to mobile terminal.
The use of mobile terminal, it usually needs user concentrates certain attention.Mobile terminal of today sets For being equipped with touch screen, user needs to touch described touch screen, to perform corresponding operation.But, When user cannot touch mobile terminal device, operation mobile terminal will become highly inconvenient.Such as, When having carried article during user drives vehicle or hands when.
Speech detection method and always listen the use of system (AlwaysListeningSystem) so that can be right Mobile terminal carries out non-manual activation and operation.When described always listen system acoustical signal to be detected time, voice Detecting system will activate, and is identified the acoustical signal detected, afterwards, mobile terminal will Corresponding operation is performed, such as, when user inputs " dialing the mobile phone of XX " according to the acoustical signal identified Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.
But, speech detection method in prior art, when being applied to always listen in system, need to protect always Hold opening to detect with the voice activity to user, accordingly, there exist the longest problem.
Summary of the invention
The problem that the embodiment of the present invention solves is the most time-consuming when carrying out speech detection.
For solving the problems referred to above, embodiments providing a kind of speech detection method, described voice is examined Survey method includes:
The acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence;
Obtained multiple voiced frames are carried out windowing process;
Voiced frame after windowing process is carried out frequency domain conversion, obtains the frequency spectrum that each voiced frame is corresponding;
Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the scramble of correspondence Spectrum;
Calculate the cepstrum distance between the cepstrum of two adjacent voiced frames;
When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is entered Row speech detection.
Alternatively, described voiced frame after windowing process is carried out frequency domain conversion, obtain each sound The frequency spectrum that frame is corresponding, including: the voiced frame after windowing process is carried out fast Fourier transform, To the frequency spectrum that each voiced frame is corresponding.
Alternatively, described frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, Arrive corresponding cepstrum, including:
c = ∫ - π π ( log S ( w ) - α ) dw 2 π
Wherein, c represents cepstrum coefficient, and S (w) represents voiced frame, and α is default correction term.
Alternatively, the cepstrum distance between the cepstrum of two voiced frames that described calculating is adjacent, including:
D = Σ j = 1 k | a j - b j |
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjTable respectively Showing the cepstrum of adjacent two voiced frame, k represents sampling frequency number.
Alternatively, the sampling frequency number of described voiced frame is 32.
Alternatively, time a length of 200ms to 1s of described gathered acoustical signal.
Alternatively, described distance threshold is by carrying out at preemphasis the sampled signal that sample frequency is 8KHz Manage, and the Hamming window that the voiced frame that frame length is 20ms adds 256 obtains.
The embodiment of the present invention additionally provides a kind of speech detection device, and described device includes:
Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence Sound frame;
Windowing process unit, is suitable to obtained multiple voiced frames are carried out windowing process;
Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each The frequency spectrum that individual voiced frame is corresponding;
Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstrum Territory is changed, and obtains the cepstrum of correspondence;
Computing unit, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame;
Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, right The acoustical signal gathered carries out speech detection.
Alternatively, described frequency domain converting unit is suitable to the voiced frame after windowing process is carried out quick Fu In leaf transformation, obtain the frequency spectrum that each voiced frame is corresponding.
Alternatively, the sampling frequency number of described voiced frame is 32.
Alternatively, time a length of 200ms to 1s of described gathered acoustical signal.
Alternatively, described distance threshold is by carrying out at preemphasis the sampled signal that sample frequency is 8KHz Manage, and the Hamming window that the voiced frame that frame length is 20ms adds 256 obtains.
Compared with prior art, technical scheme has the advantage that
By the cepstrum distance between the cepstrum of calculating adjacent sound frame, determine whether the sound to input Tone signal detects, relatively simple, only owing to calculating the computing of the cepstrum distance between alternative sounds frame Therefore, it can save calculating resource and the time of speech detection.
Further, owing to the sampling frequency number in each voiced frame is 32, cost can be calculated saving While, it is thus achieved that preferably speech detection performance.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of speech detection method in the embodiment of the present invention;
Fig. 2 is the flow chart of the another kind of speech detection method in the embodiment of the present invention;
Fig. 3 is the voice under the conditions of different clean speech of the speech detection method in the embodiment of the present invention The simulation result schematic diagram of recognition correct rate;
Fig. 4 is that the speech detection method using ITU-T G.729B standard is under the conditions of different clean speech The simulation result schematic diagram of speech recognition accuracy;
Fig. 5 is the VAD based on statistical model speech recognition accuracy under the conditions of different clean speech Simulation result schematic diagram;
Fig. 6 be the VAD based on long-term speech information speech recognition under the conditions of different clean speech just The really simulation result schematic diagram of rate;
Fig. 7 be the speech recognition under the conditions of white noise of the speech detection method in the embodiment of the present invention just The really simulation result schematic diagram of rate;
Fig. 8 is the speech detection method using ITU-T G.729B standard voice under the conditions of white noise The simulation result schematic diagram of recognition correct rate;
Fig. 9 is the emulation of the VAD based on statistical model speech recognition accuracy under the conditions of white noise Result schematic diagram;
Figure 10 is the VAD based on long-term speech information speech recognition accuracy under the conditions of white noise Simulation result schematic diagram;
Figure 11 is the structural representation of a kind of speech detection device in the embodiment of the present invention.
Detailed description of the invention
Of the prior art always listen system use voice activity detection (Voice Activity Detection, VAD) Sound is detected by technology.
Voice activity detection method the most frequently used in GSM standard, carries out background noise more at noise intervals Newly.This voice activity detection method based on frequency domain generally uses and includes linear prediction spectrum, full frequency band energy Amount, low-frequency range (0-1KHz) energy and the characteristic vector of zero-crossing rate.Specifically, will input sound letter After number device group is filtered after filtering, calculate the sound levels of each frequency range, and use there is premeasuring Results model submodule determines probability, or determines that whether the energy level of present frame is more than making an uproar of storing Sound.Above-mentioned voice activity detection method, it usually needs a reliable submodule updates and stores noise Model.
For this problem, presently, there are and comment by being dynamically tracked power envelope carrying out noise spectrum Estimate, above-mentioned voice activity detection method is further improved.A kind of method therein will be by receiving Device characteristic working curve non-voice false alarm rate under some representational noises and situation the most less and Whether voice hit rate increases, and compares with original voice activity detection method.In prior art Another kind of speech detection method then construct a kind of loaded down with trivial details voice activity detection with six kinds of loaded down with trivial details rules Method.
Above-mentioned voice activity detection method can show excellent performance in specific condition and platform. But, above-mentioned voice activity detection method is when being applied to always listen in system, owing to needs by always listening are Unification directly maintains opening, detects with the acoustical signal to input, thus there is consuming meter Calculate resource and the problem of the time of calculating.
For solving the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is first Pass through.
Understandable, below in conjunction with the accompanying drawings for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from The specific embodiment of the present invention is described in detail.
Fig. 1 shows the flow chart of a kind of speech detection method in the embodiment of the present invention.As shown in Figure 1 Speech detection method, may include that
Step S101: the acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence.
In being embodied as, in order to the acoustical signal gathered is processed, can first will collect Acoustical signal carries out overlapping framing, obtains multiple voiced frame.The acoustical signal gathered is carried out framing, real Matter is that acoustical signal is carried out short-time analysis, and short-time analysis is divided into acoustical signal and has the fixed cycle Short of time, short of each time is relatively-stationary lasting sound clip.
In being embodied as, partly overlapping between two adjacent voiced frames, overlapping range can be according to reality Border situation selects.
Step S102: obtained multiple voiced frames are carried out windowing process.
In being embodied as, the Speech processing such as Hamming window, Hanning window, rectangular window can be selected to commonly use Window function, frame length is chosen as 10~40ms, and representative value is 20ms.
In being embodied as, voice signal is carried out sub-frame processing and destroys the naturalness of acoustical signal, logical Cross use voiced frame and carry out windowing and return process etc., this problem can be solved.
Step S103: the voiced frame after windowing process is carried out frequency domain conversion, obtains each voiced frame Corresponding frequency spectrum.
In being embodied as, the acoustical signal gathered in theory for be time dependent, be one Astable process, it is not possible to directly carry out the conversion of frequency domain.But, due to the sound letter gathered Number carrying out sub-frame processing (short-time analysis), the acoustical signal of every frame may be considered metastable, thus Can apply and carry out frequency domain conversion.Wherein, the frequency spectrum that each obtained voiced frame is corresponding includes frequency Relation with energy.
Step S104: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains Corresponding cepstrum.
In being embodied as, each voiced frame signal obtained can be carried out after framing windowing process Frequency domain is changed.
In being embodied as, cepstrum is that the logarithm value to power spectrum carries out inverse fourier transform (Inverse Fourier Transform, IFT), complicated convolution relation is become simple linear superposition so that scramble The group of frequencies component amount of each voiced frame signal can be identified relatively easily in spectrum.
Step S105: calculate the cepstrum distance between the cepstrum of adjacent voiced frame.
In being embodied as, by calculating the cepstrum distance between the cepstrum of adjacent voiced frame, with really Determine whether the acoustical signal gathered to be carried out speech detection.Wherein, with prior art calculates adjacent sound The spectrum energy of sound frame (current sound frame and the delay voiced frame with default time delay) is compared, and calculates phase Cepstrum distance between the cepstrum of adjacent voiced frame, can reduce the complexity of calculating, therefore, it is possible to Save and calculate resource and the time of calculating.
Step S106: when the cepstrum distance calculated is more than the distance threshold preset, to gathered Acoustical signal carries out speech detection.
In being embodied as, when the cepstrum distance calculated is more than the distance threshold preset, show defeated Containing voice signal in the acoustical signal entered, at this point it is possible to the acoustical signal gathered is carried out voice inspection Survey, to identify voice signal therein.
Fig. 2 shows the flow chart of the another kind of speech detection method in the embodiment of the present invention.Such as Fig. 2 institute The speech detection method shown, may include that
Step S201: the acoustical signal of preset duration is entered framing windowing process.
In being embodied as, first the acoustical signal inputted can be carried out overlapping framing, obtain frame by frame Signal.Wherein, frame length is chosen as 20ms, adjacent before and after partly overlap between two voiced frames.Afterwards, Voiced frame after framing can add the Hamming window of 256, and wherein, sample rate is 8kHz, and frame length is 20ms, interframe overlap is 50%, then a frame acoustical signal has 160 sampled points, by signal End zero padding obtains 256 sampled points.
In being embodied as, the time delay of adjacent voiced frame has important work in the calculating of cepstrum distance With.The most extended when putting longer, longer first tone signal with continuous print frequency spectrum may be returned by mistake Class;The most extended when putting longer, can cause when carrying out speech detection needing the longer startup time, and And to store the spectrum vector that more voiced frame is corresponding.In embodiments of the present invention, the sound letter gathered Number time span could be arranged to 200ms to 1s, to improve the performance of speech detection.
In being embodied as, the time delay when determining between different spectrum vector, following public affairs can be used Formula carries out simple z conversion to the time delay between different spectrum vectors, and conversion is to frequency domain:
F (x)=x (n-m)=> > F (z)=z-mX(Z) (2)
Wherein, f (x) represents the difference in time domain between two sampled points, and n represents the finger of current sampling point Number (index), the index (index) of any sampled point before m table current sampling point, F (z) represents F Through the function expression of z conversion, X (Z) represents x function expression after z changes.
Step S202: the voiced frame after framing windowing process is carried out FFT process, obtains each sound The frequency spectrum that sound frame is corresponding.
In an embodiment of the present invention, by the voiced frame after framing windowing process is carried out quick Fu In leaf transformation (Fast Fourier Transform, FFT) process, the frequency corresponding to obtain each voiced frame Spectrum.Wherein, the spectrogram that each voiced frame is corresponding includes the corresponding relation between frequency and amplitude.
Step S203: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains Corresponding cepstrum.
In being embodied as, frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, The corresponding relation that the cepstrum figure obtained includes between inverted frequency (q) and cepstrum coefficient (c).At this In a bright embodiment, formula below is used to be calculated the cepstrum coefficient that the voiced frame signal of input is corresponding:
c = ∫ - π π ( log S ( w ) - α ) dw 2 π - - - ( 3 )
Wherein, c represents cepstrum coefficient, and S (w) represents the voiced frame of input, and α is default correction term.
Step S204: calculate the cepstrum distance between the cepstrum of adjacent sound frame.
In being embodied as, distance computing formula different in prior art can be used, be calculated phase Cepstrum distance between the cepstrum of adjacent voiced frame.In an embodiment of the present invention, Manhattan can be used Cepstrum distance between the cepstrum of distance (city block distance) calculating adjacent sound frame:
D = Σ j = 1 k | a j - b j | - - - ( 4 )
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjRespectively Representing the cepstrum of two adjacent voiced frames, k represents sampling frequency number.
In an embodiment of the present invention, the value of k is 32, then, it is only necessary to carry out 32 subtractions With 31 sub-addition computings, just can calculate the cepstrum that has between the different cepstrum postponing frequency range away from From, therefore, it can be substantially reduced the complexity of calculating, save and calculate resource.
It is to be herein pointed out along with the adjacent sound used in the speech detection process of an acoustical signal The quantity of the cepstrum distance between the cepstrum of sound frame increases, and speech detection performance also will strengthen therewith.But It is that practice analysis shows, the quantity of the cepstrum distance between the cepstrum of the adjacent voiced frame used During more than 4, the lifting of speech detection performance will be the most small.
Step S205: judge that whether the cepstrum distance calculated is more than the distance threshold preset.
In being embodied as, calculate to the being adapted to property of distance threshold in the embodiment of the present invention, and Independent of with other parts in the embodiment of the present invention.But, practice have shown that, when distance threshold is fixing not During change, the speech detection method in the embodiment of the present invention is language under conditions of some speakers and background noise Sound detection performance is closer to.
In order to save resource and memory space, in an embodiment of the present invention, distance threshold can pass through Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and adds the voiced frame that frame length is 20ms Calculate under conditions of the Hamming window of 256.
It is to be herein pointed out distance threshold can be set according to being actually needed of terminal use.
In being embodied as, when judged result is for being, step S206 can be performed, when judged result is Time no, the most do not perform any action.
Step S206: the acoustical signal gathered is carried out speech detection.
In being embodied as, the cepstrum distance between the cepstrum of the adjacent sound frame calculated is more than During the distance threshold preset, then show gathered input audio signal includes voice signal, therefore, The input audio signal gathered can be carried out speech detection.
In being embodied as, when identifying the voice messaging in input audio signal, mobile terminal is permissible Corresponding operation is performed according to the acoustical signal identified.Such as, " dialing the mobile phone of XX " is inputted as user Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.
Speech detection method in below the present invention being implemented and VAD technology of the prior art, and ITU-T G.729 standard compares respectively.
Table 1:
Although the time of speech detection may be affected by coding techniques, but, from above-mentioned table 1 Contrast understand, the calculating time that the speech detection method in the embodiment of the present invention is used is short.Wherein, Compared with the speech detection method using frequency domain to process typical with prior art, in the embodiment of the present invention Speech detection method can save the time of more than 60%, compared with ITUT standard, then saves 40% The above time.
Fig. 3-6 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard Speech detection method, VAD based on statistical model and VAD based on long-term speech information in difference Clean speech under the conditions of the simulation result schematic diagram of speech recognition accuracy.
Understanding from the comparison of Fig. 3-6, the speech recognition of the audio recognition method in the embodiment of the present invention is correct Whether rate can reach 90%, and will not be that local speaker is affected by speaker.
Fig. 7-10 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard Speech detection method, VAD based on statistical model and VAD based on long-term speech information in white The simulation result schematic diagram of the speech recognition accuracy under noise conditions.
Understand from the comparison of Fig. 7-10, the speech recognition of the audio recognition method in the embodiment of the present invention Can, the performance under the noise circumstance of situation noise circumstance and different signal to noise ratio is higher than in prior art adopts With the speech detection method of ITU-T G.729B standard, especially under the noise circumstance of low signal-to-noise ratio.But It is that, compared with other VAD, the performance of the speech detection method in the embodiment of the present invention decreases. This is because the speech detection method in the embodiment of the present invention is by a relatively simple.Meanwhile, the present invention implements Speech detection method in example was saved while 90% calculating time, performance merely reduce 85%~ 90%, therefore may certify that effectiveness and the availability of audio recognition method in the embodiment of the present invention, suitable Speech detection is carried out in being applied to always listen in system.
Figure 11 shows the structural representation of a kind of speech detection device in the embodiment of the present invention.Such as Figure 11 Shown speech detection device 1100, can include framing unit 1101, windowing process unit 1102, frequently Territory converting unit 1103, cepstral domains converting unit 1104, computing unit 1105 and speech detection unit 1106, wherein:
Framing unit 1101, is suitable to the acoustical signal gathered carries out overlapping framing, obtains corresponding many Individual voiced frame.
Windowing process unit 1102, is suitable to obtained multiple voiced frames are carried out windowing process.
Frequency domain converting unit 1103, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, To the frequency spectrum that each voiced frame is corresponding.
In being embodied as, described frequency domain converting unit 1103 is suitable to the voiced frame after windowing process Carry out fast Fourier transform, obtain the frequency spectrum that each voiced frame is corresponding.
Cepstral domains converting unit 1104, is suitable to carry out down frequency spectrum corresponding for each obtained voiced frame Spectrum domain is changed, and obtains the cepstrum of correspondence.
Computing unit 1105, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame.
Speech detection unit 1106, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, The acoustical signal gathered is carried out speech detection.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment Suddenly the program that can be by completes to instruct relevant hardware, and this program can be stored in computer-readable In storage medium, storage medium may include that ROM, RAM, disk or CD etc..
Having been described in detail the method and system of the embodiment of the present invention above, the present invention is not limited to this. Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various change with Amendment, therefore protection scope of the present invention should be as the criterion with claim limited range.

Claims (12)

1. a speech detection method, it is characterised in that including:
The acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence;
Obtained multiple voiced frames are carried out windowing process;
Voiced frame after windowing process is carried out frequency domain conversion, obtains the frequency spectrum that each voiced frame is corresponding;
Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the cepstrum of correspondence;
Calculate the cepstrum distance between the cepstrum of two adjacent voiced frames;
When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is carried out Speech detection.
Speech detection method the most according to claim 1, it is characterised in that described will be through windowing process After voiced frame carry out frequency domain conversion, obtain the frequency spectrum that each voiced frame is corresponding, including: will be through adding Voiced frame after window processes carries out fast Fourier transform, obtains the frequency spectrum that each voiced frame is corresponding.
Speech detection method the most according to claim 2, it is characterised in that described by obtained each The frequency spectrum that voiced frame is corresponding carries out cepstral domains conversion, obtains the cepstrum of correspondence, including:
c = ∫ - π π ( log S ( w ) - α ) dw 2 π
Wherein, c represents cepstrum coefficient, and S (w) represents voiced frame, and α is default correction term.
Speech detection method the most according to claim 1, it is characterised in that adjacent two of described calculating Cepstrum distance between the cepstrum of voiced frame, including:
D = Σ j = 1 k | a j - b j |
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjTable respectively Showing the cepstrum of adjacent two voiced frame, k represents sampling frequency number.
Speech detection method the most according to claim 1, it is characterised in that the sampling frequency of described voiced frame Count is 32.
Speech detection method the most according to claim 1, it is characterised in that described gathered sound letter Number time a length of 200ms to 1s.
Speech detection method the most according to claim 1, it is characterised in that described distance threshold is by right Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length Frame adds the Hamming window of 256 and obtains.
8. a speech detection device, it is characterised in that including:
Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence Frame;
Windowing process unit, is suitable to obtained multiple voiced frames are carried out windowing process;
Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each The frequency spectrum that voiced frame is corresponding;
Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains Conversion, obtains the cepstrum of correspondence;
Computing unit, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame;
Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, to institute The acoustical signal gathered carries out speech detection.
Speech detection device the most according to claim 8, it is characterised in that described frequency domain converting unit is fitted In the voiced frame after windowing process is carried out fast Fourier transform, obtain each voiced frame corresponding Frequency spectrum.
Speech detection device the most according to claim 8, it is characterised in that the sampling frequency of described voiced frame Count is 32.
11. speech detection devices according to claim 8, it is characterised in that described gathered sound letter Number time a length of 200ms to 1s.
12. speech detection devices according to claim 8, it is characterised in that described distance threshold is by right Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length Frame adds the Hamming window of 256 and obtains.
CN201510119374.XA 2015-03-18 2015-03-18 Voice detection method and device Pending CN106024017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510119374.XA CN106024017A (en) 2015-03-18 2015-03-18 Voice detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510119374.XA CN106024017A (en) 2015-03-18 2015-03-18 Voice detection method and device

Publications (1)

Publication Number Publication Date
CN106024017A true CN106024017A (en) 2016-10-12

Family

ID=57082366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510119374.XA Pending CN106024017A (en) 2015-03-18 2015-03-18 Voice detection method and device

Country Status (1)

Country Link
CN (1) CN106024017A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107393559A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 The method and device of calibration voice detection results
CN108470571A (en) * 2018-03-08 2018-08-31 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency detection, device and storage medium
CN109410971A (en) * 2018-11-13 2019-03-01 无锡冰河计算机科技发展有限公司 A kind of method and apparatus for beautifying sound
CN111192600A (en) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 Sound data processing method and device, storage medium and electronic device
CN111433737A (en) * 2017-12-04 2020-07-17 三星电子株式会社 Electronic device and control method thereof
CN113409827A (en) * 2021-06-17 2021-09-17 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on local convolution block attention network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Speech endpoint detection method and device
CN103021405A (en) * 2012-12-05 2013-04-03 渤海大学 Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter
CN103996399A (en) * 2014-04-21 2014-08-20 深圳市北科瑞声科技有限公司 Voice detection method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Speech endpoint detection method and device
CN103021405A (en) * 2012-12-05 2013-04-03 渤海大学 Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter
CN103996399A (en) * 2014-04-21 2014-08-20 深圳市北科瑞声科技有限公司 Voice detection method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王帛 等: ""一种基于短时倒谱速变率的语音信号平滑端点检测方法"", 《现代电子技术》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107393559A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 The method and device of calibration voice detection results
CN107393559B (en) * 2017-07-14 2021-05-18 深圳永顺智信息科技有限公司 Method and device for checking voice detection result
CN111433737A (en) * 2017-12-04 2020-07-17 三星电子株式会社 Electronic device and control method thereof
CN108470571A (en) * 2018-03-08 2018-08-31 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency detection, device and storage medium
CN108470571B (en) * 2018-03-08 2020-09-08 腾讯音乐娱乐科技(深圳)有限公司 Audio detection method and device and storage medium
CN109410971A (en) * 2018-11-13 2019-03-01 无锡冰河计算机科技发展有限公司 A kind of method and apparatus for beautifying sound
CN109410971B (en) * 2018-11-13 2021-08-31 无锡冰河计算机科技发展有限公司 Method and device for beautifying sound
CN111192600A (en) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 Sound data processing method and device, storage medium and electronic device
CN113409827A (en) * 2021-06-17 2021-09-17 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on local convolution block attention network

Similar Documents

Publication Publication Date Title
CN106486131B (en) Method and device for voice denoising
US8972255B2 (en) Method and device for classifying background noise contained in an audio signal
US11475907B2 (en) Method and device of denoising voice signal
Tsoukalas et al. Speech enhancement based on audible noise suppression
CN106024017A (en) Voice detection method and device
CN101149928B (en) Sound signal processing method, sound signal processing device and computer program
CN107833581B (en) Method, device and readable storage medium for extracting fundamental tone frequency of sound
WO2011091068A1 (en) Distortion measurement for noise suppression system
CN109616098B (en) Voice endpoint detection method and device based on frequency domain energy
CN101271686A (en) Method and apparatus for estimating noise using harmonics of a speech signal
Kumar Real-time performance evaluation of modified cascaded median-based noise estimation for speech enhancement system
WO2021000498A1 (en) Composite speech recognition method, device, equipment, and computer-readable storage medium
Schwerin et al. An improved speech transmission index for intelligibility prediction
CN103903633A (en) Method and apparatus for detecting voice signal
CN103440872A (en) Denoising Method of Transient Noise
CN106033669A (en) Voice identification method and apparatus thereof
CN106920543B (en) Audio recognition method and device
CN107564512B (en) Voice activity detection method and device
CN106816157A (en) Audio recognition method and device
CN120148484B (en) Speech recognition method and device based on microcomputer
Kumar Mean-median based noise estimation method using spectral subtraction for speech enhancement technique
CN118314919B (en) Voice repair method, device, audio equipment and storage medium
CN112216285B (en) Multi-user session detection method, system, mobile terminal and storage medium
CN106340310B (en) Speech detection method and device
KR20090098891A (en) Robust language activity detection method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012

RJ01 Rejection of invention patent application after publication