CN108172242B

CN108172242B - Improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method

Info

Publication number: CN108172242B
Application number: CN201810014999.3A
Authority: CN
Inventors: 鲁霖
Original assignee: Shenzhen Xinzhongxin Technology Co Ltd
Current assignee: Shenzhen Xinzhongxin Technology Co Ltd
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2021-06-01
Anticipated expiration: 2038-01-08
Also published as: CN108172242A

Abstract

The invention relates to an improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method which comprises an intelligent cloud sound box, intelligent equipment, data analysis processing software APP and a Bluetooth module. The intelligent device is a mobile phone, a tablet personal computer and the like; the intelligent equipment comprises a Bluetooth module and data analysis processing software APP; the intelligent cloud sound box comprises a cloud server; the data analysis processing software APP is installed on the intelligent equipment; the Bluetooth module and the Bluetooth intelligent cloud sound box are connected with each other through an audio channel; the data analysis processing software APP of the intelligent device establishes connection of a control instruction with the Bluetooth intelligent cloud sound box through the Bluetooth module, and control data interaction between the data analysis processing software APP and the Bluetooth intelligent cloud sound box is achieved; the invention has the beneficial effects that: the problems of poor recognition rate, end point misjudgment and the like caused by environmental differences in the prior art are solved, and the man-machine voice interaction efficiency and experience are improved. The efficiency is improved, and the user experience is improved.

Description

Improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method

Technical Field

The invention relates to the field of Bluetooth low energy consumption technology application, in particular to an improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method.

Background

In the field of man-machine interaction, Voice Activity Detection (VAD) is a very critical work, the quality of an algorithm of the Voice Activity Detection is a certain degree to directly determine the success or failure of the whole Voice interaction system, the Voice Activity Detection is used as a complete Voice interaction system, the final implementation and use effects of the Voice Activity Detection are not only dependent on the recognition algorithm, a plurality of related factors directly influence the success or failure of an application system, the purpose of end point Detection is to distinguish a Voice signal from a non-Voice signal in a signal stream under a complex application environment and determine the beginning and the end of the Voice signal, a good end point Detection method can solve the problems of unsatisfactory Detection effect, low recognition rate and the like of Voice recognition software, and the high accuracy of end point Detection can ensure that an input signal is an effective and complete Voice signal, so that the recognition effect is more accurate and rapid.

The traditional end point detection method uses double threshold detection of short-time energy and zero crossing rate, firstly, the first judgment is carried out on the short-time energy of audio, and a high threshold is selected for carrying out a rough judgment; and then the average zero-crossing rate is used for carrying out second discrimination. Although the double-threshold end point detection has small calculation amount and can gnaw a better recognition rate in a quiet environment, the double-threshold end point detection has a plurality of defects, for example, a threshold value needs to be set by experience and is a fixed parameter; in real-time voice interaction, scenes related to context pause are easy to misjudge, and the human-computer interaction effect is not ideal.

Therefore, in daily life, the field of man-machine interaction is involved, and how to accurately detect the end point position of an audio signal is a problem that needs to be solved urgently by technical staff.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method overcomes the problems of poor recognition rate, endpoint misjudgment and the like caused by environment difference in the prior art, and improves human-computer voice interaction efficiency and experience.

In order to solve the technical problem, the invention provides an improved method for detecting a voice interaction endpoint of a Bluetooth intelligent cloud sound box. The intelligent device is a mobile phone, a tablet personal computer and the like; the intelligent equipment comprises a Bluetooth module and data analysis processing software APP; the intelligent cloud sound box comprises a cloud server;

the data analysis processing software APP is installed on the intelligent equipment;

the Bluetooth module is connected with a Bluetooth intelligent cloud sound box through an audio channel;

further optimizing, establishing connection of a control instruction between data analysis processing software APP of the intelligent device and the Bluetooth intelligent cloud sound box through the Bluetooth module, and realizing control data interaction between the data analysis processing software APP and the Bluetooth intelligent cloud sound box;

further optimize, normal data analysis processing software APP is in the standby state, and when the intelligent device end awakens the voice interaction, data analysis processing software APP starts the bluetooth module to connect to begin the recording, gather audio signal, establish the data transmission passageway with the high in the clouds server of bluetooth intelligence cloud audio amplifier simultaneously.

Further optimizing, setting a mute protection time by the data analysis processing software APP, wherein the protection time is agreed by the data analysis processing software APP and the cloud server; when the voice interaction is awakened, even if the user does not speak, the mute acquisition time is 3 seconds, so that the situation that the whole system stops judging when the user does not have time to speak when the voice interaction is awakened is avoided; in addition, the connection-oriented SCO of the bluetooth module is operated too frequently in a very short time, which may cause system-level abnormality, and the mute protection time controls the connection-oriented SCO of the bluetooth module to be operated too frequently in a very short time.

Further optimizing, extracting each frame of audio signal from time to time by data analysis processing software APP of the intelligent equipment; the data analysis processing software APP sets the duration of the audio signal for each frame to 10 ms.

Further optimizing, calculating the short-time energy of each frame of audio signal by data analysis processing software APP of the smart phone, wherein the calculation formula of the short-time energy signal is as follows:

；

further optimizing, dynamically judging whether each frame of audio signal is a voice frame by data analysis processing software APP of the intelligent equipment; the method comprises the steps that short-time energy directly reflects voice signal energy and amplitude, a voiced segment and a unvoiced segment are judged according to the short-time energy, data analysis processing software APP dynamically searches the maximum energy value of each frame and the previous audio frame, the threshold value is dynamically reduced as long as the following audio frame is smaller than the maximum energy frame threshold value (M), when the current short-time energy is small, the amplitude value of volume attenuation is too large, a non-voice frame is defined, non-voice counting is started, the non-voice frame is continuously counted for 200, and the pause is 2 seconds, so that the end of speaking is indicated, and if voice frame data exist in the middle, a counter is reset and counts again.

The formula of the adaptive threshold value is as follows:

；

further optimizing, and judging effective endpoints by data analysis processing software APP of the intelligent equipment;

further optimizing, sending the acquisition completion to a cloud server by data analysis processing software APP of the intelligent equipment, and starting voice recognition; data analysis processing software APP stops the recording according to the result that finishes voice acquisition to send to the high in the clouds server and gather and accomplish the instruction, begin speech recognition, through in a large amount of voice interaction tests in the bluetooth intelligence cloud audio amplifier, accurately judge out the endpoint of pronunciation.

Further optimization, the working steps of the improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method are as follows:

a. the data analysis processing software APP of the intelligent equipment is connected with the Bluetooth intelligent cloud sound box;

b. the intelligent device end awakens voice interaction;

c. starting a mute protection time counter by data analysis processing software APP of the intelligent equipment;

d. extracting each frame of audio signal from time to time by data analysis processing software APP of the intelligent equipment;

e. calculating the short-time energy of each frame of audio signal by data analysis processing software APP of the intelligent equipment;

f. dynamically judging whether each frame of audio signal is a voice frame or not by data analysis processing software APP of the intelligent equipment;

h. the method comprises the steps that effective endpoint judgment is carried out by data analysis processing software APP of the intelligent equipment;

i. and sending the acquisition completion to the cloud server by the data analysis processing software APP of the intelligent equipment, and starting voice recognition.

After the technical scheme is adopted, the invention has the beneficial effects that:

compared with the prior art, the improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method is provided, the problems of poor recognition rate, endpoint misjudgment and the like caused by environment difference in the prior art are solved, and the man-machine voice interaction efficiency and experience are improved. The efficiency is improved, and the user experience is improved.

Drawings

FIG. 1 is a working block diagram of an improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method

FIG. 2 is a flow chart of an improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method

Detailed Description

The invention will be described in detail below with reference to fig. 1 to 2 and specific examples, but the invention is not limited thereto.

As shown in fig. 1 to 2, an improved method for detecting a voice interaction endpoint of a bluetooth smart cloud speaker includes a smart cloud speaker, a smart device, data analysis processing software APP, and a bluetooth module. The intelligent device is a mobile phone, a tablet personal computer and the like; the intelligent equipment comprises a Bluetooth module and data analysis processing software APP; the intelligent cloud sound box comprises a cloud server; the data analysis processing software APP is installed on the intelligent equipment; the Bluetooth module and the Bluetooth intelligent cloud sound box are connected with each other through an audio channel; the data analysis processing software APP of the intelligent device establishes connection of a control instruction with the Bluetooth intelligent cloud sound box through the Bluetooth module, and control data interaction between the data analysis processing software APP and the Bluetooth intelligent cloud sound box is achieved; normal data analysis processing software APP is in the standby state, and when the intelligent device end awakens the voice interaction, data analysis processing software APP starts the Bluetooth module to connect to begin the recording, gather audio signal, establish the data transmission passageway with the high in the clouds server of bluetooth intelligence cloud audio amplifier simultaneously. Setting a mute protection time by the data analysis processing software APP, wherein the protection time is agreed by the data analysis processing software APP and the cloud server; when the voice interaction is awakened, even if the user does not speak, the mute acquisition time is 3 seconds, so that the situation that the whole system stops judging when the user does not have time to speak when the voice interaction is awakened is avoided; in addition, the connection-oriented SCO of the bluetooth module is operated too frequently in a very short time, which may cause system-level abnormality, and the mute protection time controls the connection-oriented SCO of the bluetooth module to be operated too frequently in a very short time. Extracting each frame of audio signal from time to time by data analysis processing software APP of the intelligent equipment; the data analysis processing software APP sets the duration of the audio signal for each frame to 10 ms. Data analysis processing software APP of smart phone calculates short-time energy of each frame of audio signal, and calculation of short-time energy signalThe formula is as follows:

(ii) a Dynamically judging whether each frame of audio signal is a voice frame or not by data analysis processing software APP of the intelligent equipment; the method comprises the steps that short-time energy directly reflects voice signal energy and amplitude, a voiced segment and a unvoiced segment are judged according to the short-time energy, data analysis processing software APP dynamically searches the maximum energy value of each frame and the previous audio frame, the threshold value is dynamically reduced as long as the following audio frame is smaller than the maximum energy frame threshold value (M), when the current short-time energy is small, the amplitude value of volume attenuation is too large, a non-voice frame is defined, non-voice counting is started, the non-voice frame is continuously counted for 200, and the pause is 2 seconds, so that the end of speaking is indicated, and if voice frame data exist in the middle, a counter is reset and counts again.

The formula of the adaptive threshold value is as follows:

；

the method comprises the steps that effective endpoint judgment is carried out by data analysis processing software APP of the intelligent equipment; data analysis processing software APP of the intelligent equipment sends acquisition completion to a cloud server, and voice recognition is started; data analysis processing software APP stops the recording according to the result that finishes voice acquisition to send to the high in the clouds server and gather and accomplish the instruction, begin speech recognition, through in a large amount of voice interaction tests in the bluetooth intelligence cloud audio amplifier, accurately judge out the endpoint of pronunciation.

The working steps of the improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method are as follows:

b. the intelligent device end awakens voice interaction;

In the embodiment of the invention:

s101, data analysis processing software APP of the intelligent equipment is connected with Bluetooth intelligent cloud sound box equipment;

firstly, establishing connection of an audio channel with a Bluetooth intelligent cloud sound box through a Bluetooth module in a mobile phone system; then, connection of a control instruction is established between the APP and the Bluetooth intelligent cloud sound box through data analysis processing software of the intelligent equipment, in order to guarantee good compatibility, SPP channel connection is established between the Android version and the equipment, BLE channel connection is established between the IOS version, and control data interaction between the APP and the Bluetooth intelligent cloud sound box equipment can be achieved.

S102, the intelligent device end awakens voice interaction;

normal data analysis processing software APP handles standby state, only when equipment end awakens the voice interaction, starts the bluetooth SCO connection to begin the recording, gather audio signal, establish data transmission channel with the high in the clouds server simultaneously.

S103, starting a mute protection time counter by data analysis processing software APP of the intelligent equipment;

the data analysis processing software APP of the intelligent device starts a mute protection time counter, and for better experience of a user and stability of a system, a mute protection time is set, when voice interaction is awakened, even if the user does not speak, the specific duration is agreed with a cloud server, 3 seconds of mute acquisition time is available, and the situation that the user does not speak in time when the voice interaction is awakened, the whole system is judged to stop is avoided; on the other hand, the SCO of bluetooth operates too frequently for a very short time, causing system level anomalies.

S104, extracting each frame of audio signal from time to time by data analysis processing software APP of the intelligent equipment;

the audio signal is an unsteady, time-varying signal, which is considered to be steady-state and time-invariant in a "short time" range for obtaining more accurate calculation results, and this time, the duration of the audio signal of each frame is set to 10ms by the general data analysis processing software APP.

S105, calculating the short-time energy of each frame of audio signal by data analysis processing software APP of the intelligent equipment;

the calculation formula of the short-time energy signal is as follows:

wherein, the energy value of the mth sampling point in the ith frame is shown.

In terms of the short-time energy calculation formula, APP codes are exemplified as follows:

private long getRms(int end, int span) { int begin = end - span; if (begin < 0) { begin = 0;} if (begin % 2 != 0) { begin++;} long sum = 0;for (int i = begin; i < end; i += 2) { short curSample = getShort(this.mRecording[i], this.mRecording[i + 1]); sum += (long) (curSample * curSample);} return sum; }

s106, dynamically judging whether each frame of audio signal is a voice frame by data analysis processing software APP of the intelligent equipment;

the short-time energy can directly reflect the energy and amplitude of a voice signal, and then the voiced segment and the unvoiced segment can be judged, the data analysis processing software APP dynamically searches the maximum energy value in each frame and the previous audio frame, the threshold value is dynamically reduced as long as the following audio frame is smaller than the maximum energy frame threshold value (M), when the current short-time energy is small, when the amplitude of volume attenuation is too large, a non-voice frame is defined, the non-voice counting is started, the continuous counting of the non-voice frame reaches 200, which is equivalent to 2 seconds of pause, the end of speaking is indicated, if voice frame data exists in the middle, the counter is reset, and the counter is counted again.

Adaptive threshold value:

APP example code is as follows:

private static final int RMS_COUNT_MAX = 200; // 2s

public boolean isPausing() {

long rms = getRms(this.mRecordedLength, this.mOneSec);

if (rms > this.highestRMS) {

this.highestRMS = rms;

this.rmsCount = 0;

return false;

} else if (((double) rms) < M * ((double) this.highestRMS)) {

if(this.rmsCount < RMS_COUNT_MAX){

this.rmsCount++;

return false;

}else{

this.rmsCount = 0;

return true;

}

} else {

this.rmsCount = 0;

return false;

}

s107, effective endpoint judgment is carried out by data analysis processing software APP of the intelligent equipment;

the voice endpoint judgment in the man-machine interaction is limited in many aspects, such as 3-second mute protection time, a locally improved short-time energy detection voice endpoint and a collection stopping instruction issued by a cloud end.

APP example code is as follows:

while (recorder != null && recorder.getState() == AudioRecorder.State.RECORDING) {

boolean pausing = recorder.isPausing();

if (pausing && mRecordDurationReached) {

if (mBtDeviceSpeechType == BT_DEVICE_SPEECH_RECOGNITION) {

mBtDeviceSpeechType = BT_DEVICE_SPEECH_RECOGNITION_NONE;

stopBluetoothSCO();

}

stopListening(true);

break;

}

try {

Thread.sleep(10);

} catch (InterruptedException e) {

e.printStackTrace();

}

s108, sending the collected data to a cloud end by data analysis processing software APP of the intelligent equipment, and starting voice recognition;

data analysis processing software APP stops the recording according to the result that finishes pronunciation collection to send to the high in the clouds and gather and accomplish the instruction, can begin speech recognition, can cross in a large amount of pronunciation interaction tests in the bluetooth intelligence cloud audio amplifier, can accurately judge out the endpoint of pronunciation basically. The transmission and processing of non-voice frames are greatly reduced, the efficiency is improved, and the user experience is improved.

It will be appreciated by those skilled in the art that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.

Claims

1. An improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method comprises an intelligent cloud sound box, intelligent equipment, data analysis processing software APP and a Bluetooth module; the method is characterized in that: the intelligent equipment comprises a Bluetooth module and data analysis processing software APP; the intelligent cloud sound box comprises a cloud server; the data analysis processing software APP is installed on the intelligent equipment; the Bluetooth module is connected with a Bluetooth intelligent cloud sound box through an audio channel; the data analysis processing software APP of the intelligent device establishes connection of a control instruction with the Bluetooth intelligent cloud sound box through the Bluetooth module, and control data interaction between the data analysis processing software APP and the Bluetooth intelligent cloud sound box is achieved;

b. the intelligent device end awakens voice interaction;

i. data analysis processing software APP of the intelligent equipment sends the collected data to a cloud server, voice recognition is started, and voice endpoints are accurately judged in a large number of voice interaction tests in the Bluetooth intelligent cloud sound box;

setting a mute protection time by the data analysis processing software APP, wherein the time length of the mute protection time is agreed by the data analysis processing software APP and the cloud server; when the voice interaction is awakened, even if the user does not speak, the mute acquisition time is 3 seconds, so that the situation that the whole system stops judging when the user does not have time to speak when the voice interaction is awakened is avoided; in addition, the connection-oriented SCO of the bluetooth module is operated too frequently in a very short time, which may cause system-level abnormality, and the mute protection time controls the connection-oriented SCO of the bluetooth module to be operated too frequently in a very short time.

2. The improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method according to claim 1, wherein: normal data analysis processing software APP is in the standby state, and when the intelligent device end awakens the voice interaction, data analysis processing software APP starts the Bluetooth module to connect to begin the recording, gather audio signal, establish the data transmission passageway with the high in the clouds server of bluetooth intelligence cloud audio amplifier simultaneously.

3. The improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method according to claim 1, wherein: extracting each frame of audio signal from time to time by data analysis processing software APP of the intelligent equipment; the data analysis processing software APP sets the duration of the audio signal for each frame to 10 ms.

4. The improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method according to claim 1, wherein: the data analysis processing software APP of the smart phone calculates the short-time energy of each frame of audio signal, and the calculation formula of the short-time energy signal is as follows:

5. the improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method according to claim 1, wherein: and the data analysis processing software APP of the intelligent equipment judges the effective end point.

6. The improved Bluetooth intelligent cloud speaker voice interaction endpoint detection method according to claim 1, wherein: the method comprises the steps that effective endpoint judgment is carried out by data analysis processing software APP of the intelligent equipment; data analysis processing software APP of the intelligent equipment sends acquisition completion to a cloud server, and voice recognition is started; data analysis processing software APP stops the recording according to the result that finishes voice acquisition to send to the high in the clouds server and gather and accomplish the instruction, begin speech recognition, through in a large amount of voice interaction tests in the bluetooth intelligence cloud audio amplifier, accurately judge out the endpoint of pronunciation.