CN115132197B

CN115132197B - Data processing method, device, electronic equipment, program product and medium

Info

Publication number: CN115132197B
Application number: CN202210597334.6A
Authority: CN
Inventors: 陈杰; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2024-04-09
Anticipated expiration: 2042-05-27
Also published as: CN115132197A

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment, a program product and a medium, which can be applied to the technical field of data processing. The method comprises the following steps: determining whether the voice data of the target time window hit command words according to the audio characteristics respectively corresponding to the voice data of the K voice frames in the target time window; when the voice data of the target time window hits the command word, determining a verification time window associated with the current voice frame; determining a first confidence coefficient corresponding to the voice data and each command word in the verification time window, and determining an associated feature corresponding to the verification time window; and determining hit result command words based on the first confidence corresponding to each command word and the associated features. By adopting the embodiment of the application, the accuracy of command word detection of voice data is improved. The embodiment of the application can also be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, intelligent household appliances and the like.

Description

Data processing method, device, electronic equipment, program product and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, electronic device, program product, and medium.

Background

Currently, a voice detection technology is widely used, and a voice detection function is provided in many intelligent devices (such as a vehicle-mounted system, an intelligent sound box, an intelligent household appliance and the like), and the intelligent devices can receive instructions issued in a voice form, detect the instructions based on received voice data and execute corresponding operations. However, the inventors found in practice that the accuracy of detection of command words in speech data is low when detecting instructions in speech data.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, electronic equipment, a program product and a medium, which are beneficial to improving the accuracy of command word detection of voice data.

In one aspect, an embodiment of the present application discloses a data processing method, where the method includes:

determining a target time window corresponding to a current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window, wherein K is a positive integer;

determining whether the voice data of the target time window hit command words in a command word set according to the audio features respectively corresponding to the voice data of the K voice frames, wherein the command word set comprises at least one command word;

Determining a verification time window associated with the current speech frame when the speech data of the target time window hits a command word in the set of command words;

according to the audio characteristics respectively corresponding to the voice data of a plurality of voice frames in the verification time window, determining a first confidence level respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining the associated characteristics corresponding to the verification time window based on the voice data of a plurality of voice frames in the verification time window;

and determining a result command word hit by the voice data of the verification time window in the command word set based on the first confidence coefficient corresponding to each command word and the association feature.

In one aspect, an embodiment of the present application discloses a data processing apparatus, the apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for determining a target time window corresponding to a current voice frame and acquiring audio characteristics respectively corresponding to voice data of K voice frames in the target time window, wherein K is a positive integer;

the processing unit is used for determining whether the voice data of the target time window hit command words in a command word set according to the audio features respectively corresponding to the voice data of the K voice frames, wherein the command word set comprises at least one command word;

The processing unit is further configured to determine a verification time window associated with the current speech frame when the speech data of the target time window hits a command word in the command word set;

the processing unit is further configured to determine, according to audio features respectively corresponding to the voice data of the plurality of voice frames in the verification time window, a first confidence level of each command word in the command word set, where the first confidence level corresponds to the voice data in the verification time window, and determine, based on the voice data in the verification time window, an associated feature corresponding to the verification time window;

the processing unit is further configured to determine a result command word that the voice data of the verification time window hits in the command word set based on the first confidence coefficient corresponding to each command word and the association feature.

In one aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, where the memory is configured to store a computer program, the computer program including program instructions, the processor being configured to perform the steps of:

according to the audio features respectively corresponding to the voice data of a plurality of voice frames in the verification time window, determining a first confidence level respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining the associated features corresponding to the verification time window based on the voice data in the verification time window;

In one aspect, embodiments of the present application provide a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, are configured to perform the steps of:

In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement the method provided in one of the aspects above.

The embodiment of the application provides a data processing scheme which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, according to the audio features corresponding to the voice data of the K voice frames in the target time window, it may be determined whether the voice data of the target time window hits a command word in the command word set, and when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame is determined, so as to determine first confidence levels corresponding to the voice data in the verification time window and each command word in the command word set, and determine association features corresponding to the verification time window, so as to determine a command word hit by the voice data of the verification time window in the command word set based on the first confidence levels corresponding to each command word and the association features. Optionally, after determining the result command word, the operation indicated by the result command word may also be performed. Therefore, after the command word is determined through primary detection, namely, the command word hit of the voice data is primarily determined based on the target time window, secondary detection is performed, namely, a new verification time window is determined to determine whether the voice data contains the command word or not, and the correlation characteristic is added when the secondary verification is performed, so that whether the command word hit of the verification time window is determined based on more information, and the accuracy of the command word detection of the voice data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the effect of a target time window according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of another data processing method according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of yet another data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a first level detection network according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data processing method according to an embodiment of the present disclosure;

FIG. 8 is a flow chart of yet another data processing method according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another data processing method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In one possible implementation, the embodiment of the present application may be applied to a data processing system, and referring to fig. 1, fig. 1 is a schematic structural diagram of a data processing system provided by the embodiment of the present application. As shown in fig. 1, the data processing system may include a voice-initiated object and a data processing device. Wherein the voice-initiated object may be used to send voice data to the data processing device, the voice-initiated object may be a user or device, etc., that needs to request the data processing device to respond, without limitation. The data processing device may execute the above-mentioned data processing scheme, and may perform corresponding operations based on the received voice data, for example, the data processing device may be an in-vehicle system, a smart speaker, a smart home appliance, or the like. That is, after the voice initiating object outputs the voice data, the data processing device may receive the voice data, and further the data processing device may detect a command word in the voice data based on the above data processing scheme, and then perform an operation corresponding to the detected command word. It will be appreciated that before the data processing apparatus detects the voice data, a command word set may be preset, where the command word set includes at least one command word, each command word may be associated with a corresponding operation, for example, an operation of turning on an air conditioner is associated with a command word "turning on an air conditioner", and when the data processing apparatus detects the voice data including the command word "turning on an air conditioner", the data processing apparatus may perform an operation of turning on an air conditioner. According to the data processing scheme, after the command word hit of the voice data is primarily determined based on the target time window, a new verification time window is determined to verify whether the voice data contains the command word or not, so that the accuracy of command word detection of the voice data by the data processing equipment in the data processing system can be improved, and a user can conveniently and accurately instruct the data processing equipment to execute corresponding operation through voice.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output a voice prompt message, where the prompt interface, popup window or voice prompt message is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected in the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant country and region.

In one possible implementation, the present embodiments may be applied in the field of artificial intelligence technology, where artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In one possible implementation, the embodiments of the present application may also be applied in the field of speech technology, such as the above-mentioned command word for detecting a speech data hit. Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

The technical scheme of the application can be applied to the electronic equipment, such as the data processing equipment. The electronic device may be a terminal, a server, or other devices for performing data processing, which is not limited in this application. Optionally, the method comprises the steps of. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. Terminals include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, intelligent speakers, intelligent appliances, and the like.

It can be understood that the above scenario is merely an example, and does not constitute a limitation on the application scenario of the technical solution provided in the embodiments of the present application, and the technical solution of the present application may also be applied to other scenarios. For example, as one of ordinary skill in the art can know, with the evolution of the system architecture and the appearance of new service scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

Based on the above description, the embodiments of the present application provide a data processing method. Referring to fig. 2, fig. 2 is a flow chart of a data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S201, determining a target time window corresponding to the current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window.

The current speech frame may be any speech frame in the acquired speech data. It is understood that the acquired voice data may be real-time voice, and for voice data continuously input in real time, the current voice frame may be the latest voice frame in the continuously input voice data. The acquired voice data may also be non-real-time voice, for example, for a whole segment of voice data generated in advance, or each voice frame may be determined as a current voice frame in sequence according to the sequence of each voice frame in the voice data.

Wherein a speech frame may comprise several sampling points, i.e. speech data of consecutive sampling points constitutes speech data of a speech frame. It will be appreciated that the time difference between adjacent sampling points is the same. There may be partially repeated samples in adjacent two speech frames or completely different samples may be included, without limitation. For example, in a 10s section of input voice data, one sampling point is determined every 10ms, 20 consecutive sampling points are determined as one voice frame, for example, in the 10s voice data, 1 st to 20 th sampling points are determined as one voice frame, 21 st to 40 th sampling points are determined as one voice frame, and so on, a plurality of voice frames are obtained. For another example, in order to avoid that the audio data of two adjacent voice frames is too greatly changed, a section of overlapped sampling points is formed between two adjacent voice frames, for example, the 1 st to 20 th sampling points are determined as a voice frame, the 15 th to 35 th sampling points are determined as a voice frame, the 30 th to 40 th sampling points are determined as a voice frame, and the like, so as to obtain a plurality of voice frames.

The target time window corresponding to the current speech frame may be a time window using the current speech frame as a reference speech frame. Alternatively, the target time window corresponding to the current speech frame may include the current speech frame. The target time window may include a plurality of voice frames, for example, K voice frames may be included in the target time window, where K is a positive integer, that is, K may be the number of all voice frames in the target time window. Optionally, the K speech frames may also be selected speech frames from all speech frames in the target time window, i.e. K may be less than or equal to the number of all speech frames in the target time window, for example, after determining the target time window, energy of each speech frame in the target time window is calculated, and then speech frames with energy lower than a certain threshold are removed, so as to obtain the K speech frames, thereby filtering out some speech frames with very low sound and reducing the calculation amount in the subsequent processing procedure. The reference speech frame of a target time window indicates that the time window is divided based on the reference speech frame, for example, the reference speech frame may be the first speech frame, the last speech frame, or the speech frame of the center position of a time window, which is not limited herein. The first speech frame and the last speech frame are characterized according to time sequence, wherein the first speech frame represents the speech frame with the earliest input time in the time window, and the last speech frame represents the speech frame with the latest input time in the time window. The target time window corresponding to the current speech frame may be a time window with the current speech frame as the first speech frame, or may be a time window with the current speech frame as the last speech frame, or may be a time window with the current speech frame as the central position of the speech frame, which is not limited herein. K may be preset, or may be determined based on the length of the acquired voice, or may be determined based on the length of the command words in the command word set, such as the maximum length or the average length, and the like, which is not limited herein.

Alternatively, the target time window corresponding to the current speech frame may not include the current speech frame. For example, when the reference speech frame is the first speech frame of a time window, the next speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the first speech frame of the target time window is the next speech frame of the current speech frame; for another example, when the reference speech frame is the last speech frame of a time window, the previous speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the last speech frame of the target time window is the previous speech frame of the current speech frame, and so on, which will not be described herein.

In this application, the determination of the subsequent target time window and the verification time window will be described mainly by taking the case that the current speech frame is the last speech frame (i.e. the reference speech frame) of the corresponding target time window as an example. For example, the continuously input voice data includes 1 st, 2 nd, 3 rd, and third nth voice frames, if the current voice frame is the 200 th voice frame, the reference voice frame is the last voice frame of the time window, the size of the target time window is 100 voice frames (i.e., the target time window corresponding to the current voice frame includes 100 voice frames, i.e., K is 100), the time window with the 200 th speech frame as the last speech frame and the size of 100 may be determined as the target time window corresponding to the 200 th speech frame, i.e. 100 speech frames (100-200 th speech frames) preceding the 200 th speech frame are determined as speech frames in the target time window corresponding to the 200 th speech frame.

As another example, a target time window is described herein by way of illustration, and referring to fig. 3, fig. 3 is a schematic diagram illustrating an effect of the target time window according to an embodiment of the present application. As shown in (1) in fig. 3, in the received voice data, each voice frame may be represented as one of the blocks, if the gray block shown as 301 in fig. 3 is determined as the current voice frame and the size of the preset target time window is 8 voice frames, then 8 voice frames before 301 (including the voice frame indicated by 301) may be determined as the target time window corresponding to 301 (as shown as 302 in fig. 3); with continuous input of voice data, if a missing command word is detected in the time window indicated by 302, a new current voice frame may be determined based on the sliding window, for example, when the sliding window is 1, a subsequent voice frame of the voice frame indicated by 301 may be determined as a new current voice frame (as indicated by 303 in (2) of fig. 3), so that 8 voice frames before 303 (including the voice frame indicated by 303) may be determined as a target time window (as indicated by 304 in fig. 3) corresponding to 303, and so on, to achieve detection of the command word in the continuously input voice data.

The audio features corresponding to the voice data of the K voice frames in the target time window are acquired, and the corresponding audio features can be determined for the voice data based on each voice frame. In one possible implementation, the audio feature may be an FBank feature (an audio feature of voice data). Specifically, if the voice data of one voice frame is a time domain signal, the corresponding FBank feature of the voice frame is obtained, and the time domain signal of the voice data of the one voice frame can be converted into a frequency domain signal through fourier transform, so that the corresponding FBank feature is determined based on the frequency domain signal obtained through calculation, which is not described herein. It will be appreciated that the audio feature may also be a feature determined based on other means, such as MFCC features (an audio feature of voice data), without limitation.

S202, determining whether the voice data of the target time window hit command words in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames.

Wherein, as described above, the command word set includes at least one command word(s). The voice data of the target time window is short for voice data of K voice frames in the target time window, for example, the command word hit by the voice data of the target time window in the command word set may refer to the command word hit by the voice data of K voice frames of the target time window in the command word set; the command words that the voice data of the target time window hit in the command word set may also be briefly described as command words that the target time window hit in the command word set.

In one possible implementation, step S202 may include the steps of: (1) and determining a second confidence coefficient corresponding to each command word in the command word set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. (2) If the command words with the second confidence coefficient being greater than or equal to the first threshold value exist in the command word set, determining that the voice data of the target time window hits the command words in the command word set. (3) If the command words with the second confidence coefficient being greater than or equal to the first threshold value do not exist in the command word set, determining that the voice data of the target time window does not hit the command words in the command word set. Wherein the second confidence level may characterize a likelihood that the speech data of the target time window is each command word, each command word may have a corresponding second confidence level. The first threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable first threshold may be set to determine whether the voice data of the target time window hits the command word in the command word set. Optionally, in order to obtain better performance, different first thresholds may be set for command words of different lengths, so as to balance the detection rate and the false detection rate of the command words of different command lengths. It will be appreciated that there may be a plurality of second confidence levels greater than or equal to the first threshold, and that each command word corresponding to a second confidence level greater than or equal to the first threshold may be a command word for which the speech data of the target time window hits. For convenience of description, the command word hit by the target time window will be referred to as a first level command word in this application.

For example, the command word set includes a command word 1, a command word 2, a command word 3 and a command word 4, and a second confidence coefficient corresponding to each command word is obtained according to the audio features of K voice frames in the target time window, where the second confidence coefficient corresponding to the command word 1 is 0.3, the second confidence coefficient corresponding to the command word 2 is 0.75, the second confidence coefficient corresponding to the command word 3 is 0.45, the second confidence coefficient corresponding to the command word 4 is 0.66, and if the first threshold is 0.6, command words with the second confidence coefficient greater than or equal to the first threshold exist in the command word set, that is, the command words 2 and 4.

S203, when the voice data of the target time window hits the command words in the command word set, determining a verification time window associated with the current voice frame.

The verification time window may be a time window for performing secondary verification on the command word, and the verification time window may include a plurality of voice frames. The verification time window and the target time window may include repeated voice frames, but the included voice frames may not be identical or identical, which is not limited herein. The range of the verification time window associated with the current speech frame needs to cover as much as possible the speech frames involved in the speech data by the command words hitting the target time window.

In one possible implementation, a verification time window associated with the current speech frame is determined, a first number of speech frames preceding the current speech frame may be determined, and then the verification time window is determined based on the first number of speech frames preceding the current speech frame. The first number may be sized in a number of ways. Specifically, the first number may be a preset number; the first number may also be determined based on the command word length (abbreviated length) of the target time window hit; the first number may also be determined based on the earliest occurrence of the primary command word in the target time window, without limitation. It will be appreciated that the present application is exemplified by the last frame of the current speech frame as the target time window, here the verification time window is determined according to the first number of speech frames before the current speech frame; if the current speech frame is the first speech frame of the target time window, the verification time window may be determined according to other manners, e.g., the current speech frame is the first speech frame of the target time window, and the verification time window may be determined according to the first number of speech frames after the current speech frame, which is not limited herein.

In a possible implementation manner, when the voice data of the target time window does not have the hit command word, the subsequent operation is not executed, so that the target time window corresponding to the new current voice frame is determined, whether the audio data of the new target time window have the hit command word is detected, and the detection of whether the audio data of the target time window corresponding to each voice frame hit the command word is realized by pushing the above. In addition, when no hit command word exists in the target time window, the subsequent step of secondary verification is not directly executed, so that the data processing efficiency is improved.

S204, according to the audio features respectively corresponding to the voice data of the voice frames in the verification time window, determining first confidence levels respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining associated features corresponding to the verification time window based on the voice data in the verification time window.

Wherein each command word herein refers to each command word in the command word set described above. The first confidence level may characterize a likelihood that the speech data of the verification time window is each command word, and each command word may have a corresponding first confidence level.

Verifying the audio features corresponding to the voice data of the plurality of voice frames in the time window, where the corresponding audio features may be determined based on the voice data of each voice frame, and the audio features may be FBank features, and the detailed description refers to the above description and is not repeated herein.

In a possible implementation manner, when the electronic device receives continuously input voice data, the audio features of each voice frame can be extracted and cached in the storage area, and after the verification time window is determined, the audio features corresponding to the voice frames in the verification time window can be directly extracted from the storage area, so that the efficiency of data processing can be improved, and repeated calculation of the audio features of the voice frames is not needed. It can be understood that the number of the audio features in the buffer storage area can be determined according to the number of the voice frames in the maximum verification time window, so that after the verification time window determined based on any one-level command word, the audio features of the voice frames in the verification time window can be quickly acquired from the storage area. The maximum verification time window may be a verification time window determined based on the command length of the command word having the largest length in the command word set. It will be appreciated that in order to avoid buffering too much data, with the input of speech data, every new speech frame is input, the audio features of the speech frame with the longest input time buffered may be deleted, thereby avoiding the waste of storage space.

The associated feature may be a related feature of the voice data in the verification time window, and the associated feature is different from the audio feature corresponding to each voice frame.

In one possible implementation, the association feature includes at least one (one or more of): the first average energy of the speech data in the verification time window, the effective speech duty cycle of the speech data in the verification time window, the signal to noise ratio of the speech data in the verification time window, and the number of speech frames in the verification time window. It will be appreciated that other features may be included in the associated feature, such as command word length of the command word hit by the target time window, and the like, without limitation.

Specifically, determining the associated feature corresponding to the verification time window based on the voice data in the verification time window may include the following steps:

(1) a first average energy of the speech data of the verification time window is determined based on the energy of the speech data of each speech frame in the verification time window. Here, the energy of the voice data of each voice frame in the verification time window may be determined first, and then the first average energy may be determined based on the energy of the voice data of each voice frame, for example, the first average energy of the voice data of the verification time window may be determined by the following formula (formula 1 and formula 2):

Wherein p represents the energy of the voice data of any voice frame in the verification time window, N represents the number of sampling points in one voice frame, and X (N) represents the amplitude value of the nth sampling point in one voice frame, so that the energy of the voice data of each voice frame can be calculated according to formula 1.

Where P represents a first average energy of the speech data of the verification time window and T represents a number of speech frames within the verification time window. p (t) represents the energy of the t-th speech frame in the verification time window, which can be calculated by the above formula 1.The sum of the energy of each speech frame in the verification time window is represented. The first average energy of the speech data of the verification time window can thus be calculated by equation 2.

(2) And determining the effective voice duty ratio of the voice data of the verification time window according to the number of the effective voice frames in the verification time window, wherein the effective voice frames are voice frames with energy larger than or equal to the first average energy. Wherein the effective speech duty cycle is used for i.e. the duty cycle of the effective speech frames in the verification time window over the verification time window. For example, the effective speech duty cycle may be determined by the following formula (formula 3):

Wherein R represents an effective speech duty ratio of the speech data of the verification time window, R represents the number of effective speech frames in the verification time window, and T represents the number of speech frames in the verification time window, whereby the effective speech duty ratio can be calculated by formula 3.

(3) And determining the signal-to-noise ratio of the voice data of the verification time window according to the second average energy and the first average energy of the valid voice frames in the verification time window. Wherein the signal-to-noise ratio may be obtained by dividing the second average energy of the valid speech frames in the verification time window by the first average energy, for example, specifically, may be determined by the following formula (formula 4):

where E-SNR represents the signal-to-noise ratio of the speech data of the verification time window, P represents the first average energy of the speech data of the verification time window, and M represents the second average energy of the valid speech frames in the verification time window, from which the signal-to-noise ratio can be calculated by equation 4.

In one possible implementation manner, when determining the first confidence that the voice data of the verification time window corresponds to each command word, the first confidence that the voice data of the verification time window corresponds to the garbage class may also be determined, that is, the likelihood that the voice data of the verification time window is not a command word is represented by the first confidence of the garbage class. The category that can be classified is equivalent to each command word and garbage when determining the command word hit by the verification time window.

S205, determining a command word of a result of the hit of the voice data of the verification time window in the command word set based on the first confidence coefficient and the associated feature corresponding to each command word.

The result command word refers to a command word hit by the voice data in the verification time window, and the result command word belongs to the command word set. It can be understood that the premise of determining the result command word hit by the voice data of the verification time window in the command word set is that the voice data of the verification time window has the hit command word in the command word set, and if the voice data of the verification time window does not have the hit command word in the command word set, the result command word hit by the voice data of the verification time window in the command word set cannot be determined. Optionally, after determining the result command word, the operation indicated by the result command word may be performed. The voice data of the verification time window is short for voice data of voice frames in the verification time window, for example, a command word in which the voice data of the verification time window hits in the command word set may refer to a command word in which the voice data of a plurality of voice frames of the verification time window hits in the command word set; the result command words hit in the command word set by the voice data of the verification time window may also be briefly described as the result command words hit in the command word set by the verification time window. It can be understood that the determining result command word is determined according to the first confidence coefficient corresponding to each command word, and the correlation feature is introduced, so that more information is introduced when the result command word is determined, and is used as an effective supplement of the first confidence coefficient, so that the command word detection accuracy is improved, for example, the command word hit in the voice data under different signal-to-noise ratio situations can be more accurately determined through the introduction of the signal-to-noise ratio of the voice data of the verification time window; as another example, the command word hit of the voice data under different average energies can be more accurately determined, such as by verifying the introduction of the first average energy of the voice data of the time window; in another example, by verifying the introduction of the effective speech duty ratio of the speech data of the time window, the command word hit by the speech data under different effective speech duty ratios can be more accurately determined. And, since the result command word can be determined by determining the first confidence coefficient and the associated feature based on the verification time window, which is equivalent to performing the secondary verification on whether the command word is hit in the continuously input voice data, the detection result of the voice data of the verification time window is taken as the final detection result, and if the result command word hit in the command word set is detected, the operation indicated by the result command word is performed. For example, if the result command word "open heating" of the voice data hit of the verification time window is detected, an operation of opening heating may be performed.

In one possible implementation, if the voice data of the verification time window does not have a hit result command word in the command word set, no operation may be performed. And then the target time window of the new current voice frame can be determined, and the steps are repeated until whether the voice data of the verification time window hit the command word in the command word set or not is determined based on the audio characteristics of the voice frames in the verification time window associated with the new current voice frame, so that the detection of the time windows corresponding to the voice frames is realized.

In one possible implementation, upon detecting a result command word that verifies a time window hit, it may also include using the result command word for other purposes, such as training other models with the extracted command word, storing the extracted command word, and so forth, without limitation.

In a possible embodiment, the command word may further include some time information, location information, etc., so that the corresponding operation may be performed according to the time information, the location information of the detected result command word at the time indicated by the time information, and the location indicated by the location information. For example, when it is detected that the result command word is "10-point on air conditioner", wherein 10-point is time information of the command word, the operation of turning on air conditioner may be performed at 10-point. Alternatively, in one possible embodiment, time information, location information, and the like in the voice may also be acquired, whereby an operation corresponding to the result command word may be performed in accordance with the detected time information, location information at the time indicated by the time information, and the location indicated by the location information.

An example is described herein to illustrate how command word detection for voice data is implemented, and referring to fig. 4, fig. 4 is a schematic flow chart of another data processing method according to an embodiment of the present application. Firstly, voice data can be received, a target time window corresponding to a current voice frame in the received voice data is determined (namely, step S401), whether the target time window hits a command word in a command word set or not is determined (namely, step S402), and particularly, the voice data of each voice frame in the target time window can be determined through the audio characteristics; if the target time window does not hit the command word in the command word set, no operation may be performed, and the target time window of the new current speech frame is determined (i.e., step S403); if the target time window hits the command word in the command word set, the second verification is performed, specifically, the verification time window associated with the current speech frame is determined (i.e. step S404), and then it is determined whether the verification time window hits the command word in the command word set (i.e. step S405), if the verification time window does not hit the command word in the command word set, no operation is performed, and the target time window of the new current speech frame is determined (i.e. step S406), and if the verification time window hits the command word in the command word set, the operation indicated by the hit command word is performed (i.e. step S407). Thereby, the accuracy of the detection of the command words in the speech data can be improved by determining the verification time window to realize the secondary verification.

In one possible scenario, the present application may be applied to detecting whether received voice data hits a command word in the event that an electronic device has been awakened. Namely, after the electronic equipment is awakened by the voice initiating object through the awakening word, detecting the hit command word based on the received voice data.

In one possible scenario, the present application may also be applied to a scenario in which the electronic device does not need to be awakened, that is, the electronic device directly determines whether to hit the command word according to the received voice data without being awakened by the wake word, which is equivalent to awakening the electronic device and executing the operation indicated by the command word when detecting that the received voice data hits the command word in the command word set. The method and the device have the advantages that the command words in the command word set are preset, the electronic equipment is triggered to execute the corresponding operation only when the voice data contains the command words, and the accuracy of detecting the command words is high, so that the voice initiating object can instruct the electronic equipment to execute the corresponding operation more quickly through the voice instruction, and the equipment does not need to be awakened first and then the instruction is issued. It can be understood that in order to reduce the false recognition rate of the command words, some words which are not very commonly used can be set in the preset command words, or the false recognition rate of the command words can be reduced by adding the word groups which are not very commonly used in the command words, so that the interaction experience can be greatly improved.

Referring to fig. 5, fig. 5 is a flowchart of another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S501, determining a target time window corresponding to the current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window.

S502, determining whether the voice data of the target time window hit command words in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames.

In one possible implementation, any one of the command words in the command word set may have one or more syllables. Syllables are the most natural phonetic units perceived by hearing, and are formed by combining one or more phonemes according to a certain rule. In Mandarin, a Chinese character is a syllable except for individual cases, for example, 4 syllables are included in the command word "turn on air conditioner". Each command word in the command word set has a corresponding syllable identification sequence, which refers to a sequence consisting of syllable identifications of syllables that the command word has, which can be used to characterize syllables. In a possible implementation manner, the syllable identification sequence of each command word can be determined through a pronunciation dictionary, wherein the pronunciation dictionary is a dictionary obtained by preprocessing and can comprise mapping relation between each word in the command word and syllable identification of syllables, so that the syllable identification of the syllable of each command word can be determined according to the pronunciation dictionary, namely, the syllable of the command word is determined. It will be appreciated that different words may have the same syllable, e.g. the command word "play song" and "cancel heating" both include the syllable "qu".

In one possible implementation, determining whether the speech data of the target time window hits a command word in the command word set may be accomplished by determining a probability that the speech data of each speech frame of the target time window corresponds to a syllable, determining a second confidence level for each command word; determining whether a command word hits or not can also be achieved through a Keyword/Filler HMM Model (a wake word detection Model); or whether the voice data of the target time window hits the command word in the command word set by other methods, which is not limited herein.

In a possible implementation manner, as described above, the command word set includes at least one command word, each command word has a plurality of syllables, and if determining whether to hit the command word is implemented by determining the second confidence level of each command word by determining the probability that the speech data of each speech frame of the calculation target time window corresponds to the syllable, step S502 may include the following steps:

(1) according to the audio characteristics respectively corresponding to the voice data of the K voice frames, determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different. The syllable output unit set is a set of classification items capable of classifying syllables corresponding to voice data of each voice frame, and includes a plurality of output units. For example, if the syllable output unit set includes the syllable output unit A, B, C, the voice data representing each voice frame can be classified as A, B or C, so that the probability that K voice frames respectively correspond to the syllable output unit A, B, C can be determined. The syllable output unit set determined based on the syllables of each command word may be a syllable output unit set determined based on the syllable identifications of the syllables of each command word, specifically, a union of syllable identifications of the syllables of each command word is determined, and each syllable identification of the union of syllable identifications corresponds to one syllable output unit. In one embodiment, the syllable output unit set further includes a garbage syllable output unit, so that syllables possessed by command words not belonging to the command word set can be classified into the garbage syllable output unit in a subsequent classification process. For example, when the command word set includes command words 1, 2, and 3, syllable identifiers of syllables included in the command word 1 are S1, S2, S3, and S4, syllable identifiers of syllables included in the command word 2 are S1, S4, S5, and S5, respectively, syllable identifiers of syllables included in the command word 3 are S7, S2, S3, and S1, respectively, it is clear that a union of syllable identifiers of syllables included in the command word 1-3 is S1, S2, S3, S4, S5, and S7, respectively, and syllable output units corresponding to S1, S2, S3, S4, S5, and S7 are obtained, respectively, and syllable output units corresponding to each syllable and a garbage output unit are determined as a syllable output unit set.

(2) And determining a second confidence coefficient of the voice data of the target time window and each command word according to the probabilities that the K voice frames respectively correspond to each syllable output unit. Wherein, the second confidence level of any command word can be determined by determining the maximum value of the product of the probabilities corresponding to each syllable of the command word, namely, the second confidence level is determined according to the product of the maximum probabilities corresponding to each syllable of the command word.

(3) If the command words with the second confidence coefficient being greater than or equal to the first threshold value exist in the command word set, determining the command words with the second confidence coefficient being greater than or equal to the first threshold value as the command words of whether the voice data of the target time window hit in the command word set. This step may be referred to the above description and will not be repeated here.

In one possible implementation manner, if any command word in the command set is represented as a target command word, determining, according to probabilities that K speech frames respectively correspond to each syllable output unit, a second confidence level of speech data in the target time window corresponding to each command word may specifically include the following steps:

(1) and determining syllable output units corresponding to each syllable of the target command word as target syllable output units, and obtaining a plurality of target syllable output units corresponding to the target command word. The target syllable output unit is a syllable output unit corresponding to each syllable of the target command word, and the determined target syllable output unit can be determined through a syllable identification sequence of the target command word, because each syllable output unit has a corresponding syllable, the target syllable output unit can be determined from a plurality of syllable output units through syllables in the syllable identification sequence. For example, the target command word is "on-heating", and syllable identifiers s1, s2, s3, and s4 (syllable identifier sequences of the target command word may be also referred to) of syllables included in the target command word can be specified from the pronunciation dictionary, and syllable output units corresponding to s1, s2, s3, and s4 can be specified from the syllable output unit set by the syllable identifier sequences, so that syllable output units corresponding to s1, s2, s3, and s4 are set as target syllable output units.

(2) From the probabilities that K voice frames respectively correspond to each syllable output unit, the probabilities that K voice frames respectively correspond to each target syllable output unit are determined, and K candidate probabilities that K voice frames respectively correspond to each target syllable output unit are obtained. The candidate probability is the probability that the target syllable output unit corresponds to any voice frame. For example, if the target syllable output unit has syllable output units corresponding to s1, s2, s3, and s4 (herein denoted as syllable output units s1, s2, s3, and s 4), the probabilities of s1 and K speech frames respectively, the probabilities of s2 and K speech frames respectively, the probabilities of s3 and K speech frames respectively, and the probabilities of s4 and K speech frames respectively can be determined, and the total number of obtained candidate probabilities corresponds to k×4.

(3) And determining the maximum candidate probability corresponding to each target syllable output unit from K candidate probabilities corresponding to each target syllable output unit, and determining the second confidence coefficient of the voice data of the target time window corresponding to the target command word according to the maximum candidate probability corresponding to each target syllable output unit. The second confidence coefficient of the voice data of the target time window and the target command word is determined according to the maximum candidate probability corresponding to each target syllable output unit, specifically, the second confidence coefficient of the voice data of the target time window and the target command word is determined according to the product of the maximum candidate probabilities corresponding to each target syllable output unit, if the product of the candidate probabilities can be directly determined as the second confidence coefficient, the second confidence coefficient can also be obtained through other mathematical calculations, and the method is not limited herein. For example, the probability that s1 corresponds to K speech frames is { G1 }, respectively ₁ 、G1 ₂ 、G1 ₃ ......G1 _K Probability G1 corresponding to the 10 th speech frame in the target time window ₁₀ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s2 corresponds to K speech frames respectively is { G2 ] ₁ 、G2 ₂ 、G2 ₃ ......G2 _K Probability G2 corresponding to the 25 th speech frame in the target time window ₂₅ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s3 corresponds to K speech frames respectively is { G3 ] ₁ 、G3 ₂ 、G3 ₃ ......G3 _K Probability G3 corresponding to the 34 th speech frame in the target time window ₃₄ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s4 corresponds to K speech frames is { G4 }, respectively ₁ 、G4 ₂ 、G4 ₃ ......G4 _K Probability G4 corresponding to the 39 th speech frame in the target time window ₃₉ Further, according to G1 ₁₀ 、G2 ₂₅ 、G3 ₃₄ G4 ₃₉ The product determines a second confidence level for the speech data of the target time window corresponding to the target command word. It will be appreciated that performing the above operations on each command word in the set of command words may determine a second confidence level for each command word.

In one possible implementation manner, the second confidence that the voice data of the target time window corresponds to the target command word is determined according to the maximum candidate probability respectively corresponding to each target syllable output unit, and may be calculated by the following formula (formula 5):

wherein C may represent a second confidence that the audio data of the target time window corresponds to the target command word. n-1 indicates the number of target syllable output units corresponding to the target command word, and n indicates the number of target syllable output units and garbage syllable output units. i represents the ith target syllable output unit, j represents the jth speech frame of the target time window, then p _ij Representing the probability of the ith target syllable output unit and the jth speech frame, max p _ij Representing the maximum candidate probability for the ith target syllable output unit corresponding to each speech frame,the product of the largest candidate probabilities corresponding to each target syllable output unit is expressed, so that the second confidence that the audio data of the target time window corresponds to the target command word can be obtained based on the formula 5.

In one possible implementation, determining whether the target time window hits the command word is determined by a trained primary detection network, which determines how the target time window hits the command word. In one implementation, the trained primary detection network may be divided into an acoustic model and a confidence generation module. The acoustic model is used for executing the steps of determining the probabilities that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. The acoustic model is typically implemented using a deep neural network, such as a DNN model (a neural network model), a CNN model (a neural network model), an LSTM model (a neural network model), and the like, without limitation. The confidence coefficient generating module may be configured to perform the above step of determining the second confidence coefficient of the voice data of the target time window corresponding to each command word based on the probabilities that the K voice frames respectively correspond to each syllable output unit, which is not described in detail herein. Optionally, the dimensions of the result output by the secondary detection network are the number of command words in the command word set, and each dimension corresponds to a second confidence level of one command word.

For example, referring to fig. 6, fig. 6 is a schematic diagram of a framework of a primary detection network according to an embodiment of the present application. As shown in fig. 6, first, the voice data of K voice frames in the target time window (as shown by 601 in fig. 6) may be acquired, then the audio feature of each voice frame (as shown by 602 in fig. 6) is determined based on 601, further, the audio feature of each voice frame is input into the acoustic model in the trained first-level detection network (as shown by 603 in fig. 6), then the result obtained based on the acoustic model is input into the confidence level generating module (as shown by 604 in fig. 6), so that the confidence level generating module determines that each command word has the target syllable output unit corresponding to syllable in combination with the pronunciation dictionary (as shown by 605 in fig. 6), further, determines the second confidence level corresponding to each command word, such as the command word 1 confidence level, the command word 2 confidence level, the command word m confidence level, and so on, thereby determining whether the audio data of the target time window hits the command word, and determining the hit first-level command word. It will be appreciated that if the second confidence level of each command word is not greater than or equal to the first threshold value, then the audio data of the target time window does not have the hit first-level command word.

In one possible implementation manner, before determining the primary command word through the trained primary detection network, the primary detection network needs to be trained, which specifically may include the following steps: (1) first sample voice data carrying syllable output unit labels are acquired. The first sample voice data is voice data for training the first-stage detection network, and the first sample voice data can be voice data containing command words, namely positive sample data, or voice data not containing command words, namely negative sample data, so that training effect is better through training of the positive and negative sample data. The syllable output unit label is used for labeling the syllable output unit actually corresponding to each voice frame in the voice data of the first sample. It can be understood that if a voice frame in the first sample voice data actually corresponds to a syllable corresponding to each command word in the syllable command word set, the syllable output unit actually corresponding to the voice frame is a syllable output unit corresponding to the syllable actually corresponding to the voice frame, and if the voice frame actually corresponds to a syllable corresponding to each command word in the syllable command word set, the syllable output unit actually corresponding to the voice frame is a garbage syllable output unit.

(2) And calling an initial first-stage detection network to determine a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data. The initial first-level detection network also includes an acoustic model, where the determining predicted syllable output unit may determine, through the acoustic model in the initial first-level detection network, a probability that each speech frame corresponds to each syllable output unit in the syllable output unit set according to an audio feature corresponding to the speech data of each speech frame in the first-sample speech data, and further determine the predicted syllable output unit based on the probability that each speech frame corresponds to each syllable output unit. The audio features corresponding to the voice data of each voice frame in the first sample voice data are the same as the calculation mode of the audio features corresponding to each voice frame in the target time window, and are not described herein.

(3) And training the predicted syllable output unit and the syllable output unit label corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained first-stage detection network. In the training process, the network parameters of the initial primary detection network are adjusted so that the predicted syllable output unit corresponding to each voice frame is gradually similar to the actual syllable output unit marked by the syllable output unit label, and therefore the trained primary detection network can accurately predict the probability of each voice frame corresponding to each syllable output unit. It will be appreciated that the predicted syllable output unit is herein determined by the acoustic model in the primary detection network, that is, training the primary detection network primarily adjusts model parameters of the acoustic model in the primary detection network.

In one possible implementation manner, if the determination of whether to hit the command word is implemented by a Keyword/Filler HMM Model, the above-mentioned primary detection network may be the Keyword/Filler HMM Model. The probability that the K speech frames respectively correspond to each syllable output unit in the syllable output unit set can be determined according to the audio features respectively corresponding to the speech data of the K speech frames, then the optimal decoding path is determined based on the probability that each syllable output unit corresponds to, and further whether the optimal decoding path passes through the HMM path (hidden markov path) of the command word is determined to determine whether the command word is hit, or the confidence level corresponding to each HMM path is determined based on the probability that each syllable output unit corresponds to determine whether the command word is hit, which is not limited herein. It is understood that the HMM path may be a command word HMM path or a filling HMM path, where each command word HMM path may be composed of HMM states corresponding to syllables of a command word in series, and the filling HMM path is composed of HMM states corresponding to a set of carefully designed non-command word pronunciation units. The confidence level corresponding to each HMM state can thus be determined based on the probability corresponding to each syllable output unit, thereby determining whether and which command word is hit.

S503, when the voice data of the target time window hits the command words in the command word set, determining a verification time window associated with the current voice frame.

In one possible implementation, the first number may be determined according to a command word length hit in the target time window, for example, the first number may be determined based on the command word length and the target preset value, so as to verify the time window according to the determination of the first number of voice frames before the current voice frame. Wherein the command word length refers to the number of syllables in the command word. For a typical chinese command word, one word corresponds to one syllable, for example, the command word "turn on air conditioner" includes four words, and the corresponding 4 syllables, i.e., the command word length is 4. Specifically, the verification time window may be determined according to the command word length of the first-level command word and the target preset value. The method specifically comprises the following steps:

(1) and determining a first quantity according to the command word length of the first-level command word and the target preset value. The target preset value may be a preset value because, in general, the pronunciation of a word (a syllable) may involve a plurality of speech frames due to the pronunciation speed or the like, and the number of the speech frames that may be involved by a plurality of syllables of a command word is greater than or equal to the number of syllables of the command word, so that the first number may be determined by determining the target preset value so that the size of the verification time window covers the speech frames involved by the first-level command word as much as possible. In one possible implementation manner, the first number may be obtained by multiplying the command word length of the first-level command word by the target preset value, so that the number of voice frames contained in the obtained verification time window is the first number. For example, if the length of the primary command word is 4 and the target preset value is 25, the first number may be 4×25=100, i.e. 100 speech frames are included in the verification time window.

(2) A verification time window associated with the current speech frame is determined based on a first number of speech frames preceding the current speech frame. The method comprises the steps that a first number of voice frames before a current voice frame comprise the current voice frame, a verification time window associated with the current voice frame is determined according to the first number of voice frames before the current voice frame, and the current voice frame is taken as the last frame of the verification time window. For example, the continuously input voice data includes 1 st, 2 nd, and 3 rd.

In one possible embodiment, as mentioned above, the first number may be a preset number, and the preset number should cover the first level of command words as much as possible, and the preset number may be set based on the longest command word length in the command word set. Specifically, the preset number may be determined based on the longest command word length and the target preset value, so as to determine the preset number as the first number, and further determine the verification time window according to the first number of voice frames before the current voice frame.

In one possible implementation manner, the first number may be further determined according to the earliest occurrence opportunity of the first-level command word hit in the target time window, and then determining the verification time window may specifically include the following steps: (1) a syllable output unit set is acquired, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different. (2) And determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. Wherein, the relevant descriptions of (1) - (2) herein refer to the above descriptions, and are not repeated herein. (3) The syllable output unit corresponding to the syllable of the command word hit by the voice data of the target time window is determined as the verification syllable output unit, and the voice frame with the highest probability corresponding to the verification syllable output unit among the K voice frames is determined as the target voice frame. The target voice frame corresponds to the voice frame in which any syllable of the primary command word is detected in the K voice frames, and the occurrence time of the primary command word can be determined. (4) And determining a verification time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame. The verification time window associated with the current voice frame may be determined according to the target voice frame with the largest number of voice frames between the current voice frame and the current voice frame, that is, the target voice frame with the largest number of voice frames between the current voice frame and the current voice frame is determined, the determined target voice frame with the largest number of voice frames between the current voice frame and the current voice frame is used for representing the earliest occurrence time of the first-level command word in the target time window, the first number is the number of voice frames with the largest number of voice frames between the current voice frame and the interval, and then the voice frame between the current voice frame and the target voice frame with the largest number of voice frames between the current voice frame and the interval is determined as the voice frame in the verification time window. It is understood that the speech frame between the current speech frame and the target speech frame includes the current speech frame and the target speech frame. By the method, a more accurate verification time window can be determined, and further, the accuracy is higher when command word detection is performed on voice data in the verification time window. For example, the continuously input voice data includes 1 st, 2 nd, and 3 rd.

In a possible implementation manner, the command word set includes command words with different command word lengths, and there are cases such as identical prefixes or confusing similar words, for example, the "open heating" and the "open heating mode" are two command words with identical prefixes but different indicated operations, in an actual processing process, since voice data is input one by one, it is possible to detect the hit "open heating" command word based on a target time window corresponding to the current voice frame when the current voice frame is one voice immediately after the completion of the input of the "open heating", but it is possible to actually trigger the command word being the "open heating mode", so that a section of voice frame after the "open heating" is also included in the verification time window, thereby performing more accurate command word detection. Taking the last frame of the target time window as an example, when determining the verification time window associated with the current voice frame, a section of voice frame after the current voice frame can be also determined as the voice frame in the verification time window, namely, a delay waiting strategy is introduced when determining the verification time window, when determining the command word through the target time window, the situation of early misrecognition occurs, but due to the introduction of the delay waiting strategy, the determined verification time window can cover a larger time window, and correct command words can still be accurately identified when performing secondary verification based on the verification time window, so that the command word identification accuracy is improved.

Specifically, determining the verification time window associated with the current speech frame may include the following steps: a verification time window associated with the current speech frame is determined based on a first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame. Wherein the first number of speech frames preceding the current speech frame includes the current speech frame, and the second number of speech frames following the current speech frame also includes the current speech frame, but the plurality of speech frames in the verification time window includes only one current speech frame. The second number may be a preset value, and the second number may be an empirical value, or may be determined according to the command word lengths of the longest command word and the first-level command word in the command word set, and specifically, the length difference may be obtained by subtracting the command word length of the first-level command word from the command word length of the longest command word, and then multiplying the length difference by the target preset value. For example, the command word length 8 of the longest command word and the command word length of the first order command word degree are 5, the length difference is 8-5=3, if the target preset value is 25, 3×25=75 can be obtained, and the second number is 75. Here, how to determine the verification time window is illustrated as an example, the continuously input voice data includes 1 st, 2 nd, and 3 rd.

S504, according to the audio features respectively corresponding to the voice data of the voice frames in the verification time window, determining first confidence levels respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining associated features corresponding to the verification time window based on the voice data in the verification time window.

The description of step S504 may refer to the description of step S204, which is not described herein.

S505, determining a third confidence coefficient of the voice data of the verification time window and corresponding to each command word based on the first confidence coefficient and the associated feature corresponding to each command word.

The third confidence level may be a likelihood that the voice data representing the verification time window is each command word, and each command word may have a corresponding third confidence level. It can be appreciated that the third confidence level is equivalent to a confidence level for calibrating the first confidence level, and due to the addition of the correlation feature, the obtained third confidence level can more accurately represent the possibility that the voice data of the verification time window is each command word, and the accuracy of determining the hit command word based on the third confidence level is higher than that of directly determining the hit command word according to the first confidence level.

The third confidence coefficient of the voice data of the verification time window and the third confidence coefficient corresponding to each command word is determined based on the first confidence coefficient and the associated feature corresponding to each command word, specifically, the third confidence coefficient of the voice data of the verification time window and the third confidence coefficient corresponding to each command word is determined based on the verification feature by performing splicing processing based on the first confidence coefficient and the associated feature corresponding to each command word to obtain the verification feature. The verification feature is a feature obtained by splicing the first confidence coefficient corresponding to each command word with other information features, where the other information features may be associated features.

S506, if the command word set has the command word with the third confidence coefficient larger than or equal to the second threshold value, determining the command word with the third confidence coefficient larger than or equal to the second threshold value and the maximum third confidence coefficient as a result command word of the voice data hit in the command word set in the verification time window, and executing the operation indicated by the result command word.

The second threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable second threshold may be set to determine the result command word. It can be appreciated that if the command word set does not have a command word with the first confidence level greater than or equal to the second threshold value, it is determined that the voice data of the verification time window does not have a hit result command word in the command word set. Optionally, after determining the result command word, the operation indicated by the result command word may be performed.

In one possible embodiment, if the first confidence level is determined, the first confidence level of the voice data of the verification time window corresponding to the garbage class is also determined, then the third confidence level of the voice data of the verification time window corresponding to the garbage class may be determined when the third confidence level is determined, and further, the maximum third confidence level may be determined among the third confidence levels except for the third confidence level corresponding to the garbage class, if the maximum third confidence level is greater than or equal to the second threshold value, the command word corresponding to the maximum third confidence level is determined as the hit result command word, and if the maximum third confidence level is less than the second threshold value, the voice data of the verification time window is classified as the garbage class, that is, the voice data of the verification time window does not have the hit result command word in the command word set.

For example, the command word set includes a command word 1, a command word 2, a command word 3 and a command word 4, and based on the first confidence coefficient and the associated feature corresponding to each command word, a third confidence coefficient corresponding to each command word is obtained, where the third confidence coefficient corresponding to the command word 1 is 0.3, the third confidence coefficient corresponding to the command word 2 is 0.73, the third confidence coefficient corresponding to the command word 3 is 0.42, the third confidence coefficient corresponding to the command word 4 is 0.58, and the third confidence coefficient corresponding to the garbage is 0.61; if the preset second threshold value is 0.6, a command word with a third confidence coefficient greater than or equal to the first threshold value exists in the command word set, namely, a command word 4, and the command word 4 is a result command word of the voice data of the verification time window hit in the command word set, namely, the input voice data hits the command word 4, so that the operation indicated by the command word 4 can be executed. If the preset second threshold value is 0.75, no command word with the third confidence coefficient larger than or equal to the first threshold value exists in the command word set, determining that the voice data of the verification time window does not have the hit command word in the command word set, and further determining a new current voice frame so as to repeatedly execute the steps, and detecting the command word.

In one possible implementation, the resulting command word is determined by a trained secondary detection network, which may include a first confidence generation network and a confidence calibration network. The first confidence coefficient generating network is configured to execute the above steps of determining the first confidence coefficient respectively corresponding to the voice data in the verification time window and each command word in the command word set according to the audio features respectively corresponding to the voice data in the plurality of voice frames in the verification time window, where the first confidence coefficient generating network may be a deep neural network, for example, may be a CLDNN model (a neural network model). Optionally, the dimension of the result output by the first confidence coefficient generating network is the dimension of the first confidence coefficient corresponding to the garbage added by adding 1 to the number of command words in the command word set. The confidence calibration network is used for executing the step of determining the command words of the result of the hit of the voice data of the verification time window in the command word set based on the first confidence coefficient corresponding to each command word and the associated feature, and the confidence calibration network can be a simple multi-layer neural network, such as a multi-layer DNN network (a neural network model). How the secondary detection network determines the result command word specifically may refer to the related descriptions of steps S504-S505, which are not described herein. In one implementation manner, when the second-level detection network is called to determine the hit result command word according to the audio features respectively corresponding to the voice data of the plurality of voice frames in the verification time window, the voice data of the plurality of voice frames in the verification time window can be sequentially input, so that the first confidence coefficient of the voice data of the verification time window corresponding to each command word is obtained, and further, the third confidence coefficient is determined according to the first confidence coefficient and the associated features, so that the hit result command word is obtained. Optionally, the dimension of the result output by the secondary detection network is the number of command words in the command word set plus 1, where the added 1 is the dimension of the first confidence coefficient corresponding to the garbage.

In one possible implementation manner, before determining the result command word through the trained secondary detection network, the secondary detection network needs to be trained, and specifically may include the following steps: (1) and acquiring second sample voice data, wherein the second sample voice data carries command word labels. The second sample voice number refers to voice data for training the secondary detection network, and the second sample voice data can be positive sample data or negative sample data. The positive sample data may be audio data in a verification time window determined based on the trained primary detection network. The negative sample data may be voice data including various non-command words. The negative sample data can also be audio data with interference noise, such as synthesized or real audio data added with noise such as music television and the like under various far-field environments, so that the accuracy of command word detection under the far-field environments or noisy environments can be improved. It can be understood that in the training process of the first-stage detection network, the adopted negative data does not include audio data with various interference noise, because when the first-stage detection network is trained by the audio data with various interference noise, the classification effect of the first-stage detection network on syllable output units is deteriorated, so that the second-stage detection network is trained by entering the audio data with the interference noise when the second-stage detection network is trained, the accuracy of command word detection under the condition of improving the interference factors is improved, the defect of the first-stage detection network is effectively compensated, and the second-stage detection network has good complementarity to the first-stage detection network. The syllable output unit label marks the command word actually corresponding to the second sample voice data, and it can be understood that if the second sample voice data actually has the corresponding command word, the syllable output unit label marks the command word actually corresponding to the second sample voice data, and if the second sample voice data actually does not have the corresponding command word, the syllable output unit label marks that the second sample voice data actually belongs to garbage.

(2) And calling a secondary detection network to determine a predicted command word corresponding to the second sample voice data. The determining of the predicted command word may be performed in the initial secondary detection network, specifically may be determining, according to the audio features corresponding to the voice data of each voice frame in the second sample voice data, a first confidence coefficient corresponding to each command word in the second sample voice data, and further determining, based on the first confidence coefficient corresponding to each command word and the associated features, the predicted command word corresponding to the second sample voice data. It can be understood that, if the trained secondary detection network determines the predicted command word corresponding to the second sample voice data based on the first confidence coefficient corresponding to each command word, the associated feature, and the second confidence coefficient corresponding to each command word, then when the secondary detection network is trained, the secondary detection network needs to be trained by the first confidence coefficient corresponding to each command word, the associated feature, and the second confidence coefficient corresponding to each command word determined based on the two sample voice data. The audio features corresponding to the voice data of each voice frame in the second sample voice data are the same as the calculation mode of the audio features corresponding to each voice frame in the target time window, and are not described herein.

(3) Training based on the predicted command words and the command word labels to obtain a trained secondary detection network. In the training process, the network parameters of the initial secondary detection network are adjusted so that the predicted command words corresponding to the second sample voice data are gradually similar to the actual corresponding command words marked by the command word labels, and therefore the trained secondary detection network can accurately predict the command words corresponding to the voice data in each verification time window.

It can be understood that, because the hardware configuration such as a CPU (central processing unit), a memory, a flash memory and the like used by the electronic device that needs to detect the instruction in the voice data is generally lower, the resource occupation of each function has strict requirements, in the present application, the command word detection in the voice data is mainly determined through the trained primary detection network and secondary detection network, the network structure is simpler, the resource occupation of the electronic device is smaller, and the command word detection performance can be effectively improved. Compared with the method that when the content of the received voice data is identified based on the voice identification technology, a better identification effect can be achieved only by using a larger-scale acoustic model and a language model, namely, the method and the device can accurately detect hit command words under the condition that resources occupy a smaller amount of resources, so that the method and the device can be applied to various scenes with limited equipment resources, and the application scene of the scheme is expanded, such as offline application scenes with limited resources, such as intelligent sound boxes and intelligent household appliances.

An example is described herein of how command word detection of voice data is achieved through two-level verification, please refer to fig. 7, fig. 7 is a schematic diagram of a data processing method according to an embodiment of the present application. As shown in fig. 7, the flow of the whole data processing method may be abstracted into first-level verification and second-level verification, so that voice data (as shown in 701 in fig. 7) may be input for the first-level verification, specifically may include determining an audio feature (as shown in 702 in fig. 7) of the voice data in a target time window corresponding to a current voice frame, so as to determine, based on a trained first-level detection network (as shown in 703 in fig. 7), a second confidence level (as shown in 704 in fig. 7) of each command word, and further perform threshold judgment, to determine whether the target time window hits the target command word. If the target time window hits the target command word, so as to enter the second-level verification, the method specifically may include determining the voice data of the verification time window (as shown by 705 in fig. 7), and further acquiring the audio feature of the voice data in the verification time window associated with the current voice frame, where it may be understood that the audio feature in the verification time window may be acquired from the cached audio features of each voice frame. And then inputting the audio features corresponding to the verification time window into a trained secondary detection network, wherein in the secondary detection network, the first confidence coefficient of each command word can be obtained based on the audio data of the voice data of the verification time window (shown as 707 in fig. 7), and the first confidence coefficient of each command word and the associated features are spliced based on the associated features determined by the voice data of the verification time window (shown as 706 in fig. 7), so as to determine the third confidence coefficient of each command word (shown as 708 in fig. 7), so that the final result command word (shown as 709 in fig. 7) can be determined, thereby the accuracy of detecting the command word can be improved by performing secondary verification on the voice data and adding more feature information when performing secondary verification.

The embodiment of the application provides a data processing scheme which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, according to the audio features corresponding to the voice data of the K voice frames in the target time window, it may be determined whether the voice data of the target time window hits a command word in the command word set, and when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame is determined, so as to determine first confidence levels corresponding to the voice data in the verification time window and each command word in the command word set, and determine association features of the voice data of the verification time window, and further determine a result command word hit by the voice data of the verification time window in the command word set based on the first confidence levels corresponding to each command word and the association features. Optionally, after determining the result command word, the operation indicated by the result command word may also be performed. Therefore, after the command word is determined through primary detection, namely, the command word hit of the voice data is primarily determined based on the target time window, secondary detection is performed, namely, a new verification time window is determined to determine whether the voice data contains the command word or not, and the correlation characteristic is added when the secondary verification is performed, so that whether the command word hit of the verification time window is determined based on more information, and the accuracy of the command word detection of the voice data can be improved.

Referring to fig. 8, fig. 8 is a flowchart of another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S801, determining a target time window corresponding to the current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window.

S802, determining whether the voice data of the target time window hit command words in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames.

S803, when the voice data of the target time window hits the command words in the command word set, determining a verification time window associated with the current voice frame.

S804, according to the audio features respectively corresponding to the voice data of the voice frames in the verification time window, determining the first confidence coefficient respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining the associated features corresponding to the verification time window based on the voice data in the verification time window.

The relevant descriptions of steps S801 to S804 may refer to steps S201 to S204, and will not be described here.

S805, performing splicing processing based on the second confidence coefficient corresponding to each command word, the first confidence coefficient corresponding to each command word and the associated feature, and obtaining the verification feature.

The second confidence level may be a confidence level determined based on the voice data of the target time window. As described above, the verification feature is a feature obtained by splicing the first confidence coefficient corresponding to each command word with other information features, where the other information features may be the association feature and the second confidence coefficient corresponding to each command word.

S806, determining third confidence of voice data of the verification time window corresponding to each command word based on the verification features.

The determining of the third confidence coefficient based on the verification feature may refer to the above description of determining the third confidence coefficient corresponding to each command word and the voice data of the verification time window based on the correlation feature and the first confidence coefficient corresponding to each command word, which is not described herein.

S807, if the command word set has the command word with the third confidence coefficient larger than or equal to the second threshold value, determining the command word with the third confidence coefficient larger than or equal to the second threshold value and the maximum third confidence coefficient as a result command word of the voice data hit in the command word set in the verification time window, and executing the operation indicated by the result command word.

Step S807 may refer to the related description of step S506, which is not described herein.

An example is described herein of how command word detection of voice data is achieved through two-level verification, please refer to fig. 9, fig. 9 is a schematic diagram of another data processing method according to an embodiment of the present application. As shown in fig. 9, the flow of the whole data processing method may be abstracted into first-level verification and second-level verification, so that voice data (as shown in 901 in fig. 9) may be input for the first-level verification, specifically may include determining an audio feature (as shown in 902 in fig. 9) of the voice data in a target time window corresponding to a current voice frame, so as to determine, based on a trained first-level detection network (as shown in 903 in fig. 9), a second confidence level (as shown in 904 in fig. 9) of each command word, and further perform threshold judgment, to determine whether the target time window hits the target command word. If the target time window hits the target command word, so as to enter the second-level verification, the method specifically may include determining the voice data of the verification time window (as shown in 905 in fig. 9), and further acquiring the audio feature of the voice data in the verification time window associated with the current voice frame, where it may be understood that the audio feature in the verification time window may be acquired from the cached audio features of each voice frame. And inputting the audio features corresponding to the verification time window into a trained secondary detection network, wherein in the secondary detection network, the first confidence coefficient of each command word can be obtained based on the audio data of the voice data of the verification time window (shown as 907 in fig. 9), and the first confidence coefficient of each command word, the associated features and the second confidence coefficient of each command word (shown as 908 in fig. 9) are spliced based on the associated features determined by the voice data of the verification time window (shown as 906 in fig. 9), so as to determine the third confidence coefficient of each command word (shown as 909 in fig. 9), so that the final result command word 910 can be determined, and further feature information can be added during the secondary verification of the voice data, thereby improving the accuracy of the detection of the command words.

In one possible implementation manner, the method can further perform splicing processing based on the second confidence coefficient corresponding to each command word and the first confidence coefficient corresponding to each command word to obtain verification features, and further determine the third confidence coefficient corresponding to the voice data of the verification time window and each command word based on the verification features. It can be understood that when determining the verification feature, the verification feature can be obtained by splicing one or more of the second confidence coefficient corresponding to each command word and the associated feature with the first confidence coefficient corresponding to each command word, and it can be understood that the verification feature can also be obtained by splicing the first confidence coefficient corresponding to each command word with other associated features, so that the result command word can be determined by introducing feature information of more voice data, and the accuracy of command word detection is greatly improved.

Optionally, after determining the result command word, the operation indicated by the result command word may be performed.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. Alternatively, the data processing device may be provided in the above-described electronic apparatus. As shown in fig. 10, the data processing apparatus described in the present embodiment may include:

an obtaining unit 1001, configured to determine a target time window corresponding to a current speech frame, and obtain audio features corresponding to speech data of K speech frames in the target time window, where K is a positive integer;

a processing unit 1002, configured to determine, according to audio features respectively corresponding to the voice data of the K voice frames, whether the voice data of the target time window hits a command word in a command word set, where the command word set includes at least one command word;

the processing unit 1002 is further configured to determine a verification time window associated with the current speech frame when the speech data of the target time window hits a command word in the command word set;

the processing unit 1002 is further configured to determine, according to audio features corresponding to the voice data of the plurality of voice frames in the verification time window, a first confidence level that the voice data in the verification time window corresponds to each command word in the command word set, and determine, based on the voice data in the verification time window, an associated feature corresponding to the verification time window;

The processing unit 1002 is further configured to determine a result command word that the voice data of the verification time window hits in the command word set based on the first confidence coefficient corresponding to each command word and the association feature.

In one implementation, the processing unit 1002 is specifically configured to:

determining a second confidence coefficient of the voice data of the target time window corresponding to each command word in the command word set according to the audio characteristics respectively corresponding to the voice data of the K voice frames;

if the command words with the second confidence coefficient being greater than or equal to the first threshold value exist in the command word set, determining that the voice data of the target time window hits the command words in the command word set;

and if the command words with the second confidence coefficient being greater than or equal to the first threshold value do not exist in the command word set, determining that the voice data of the target time window does not hit the command words in the command word set.

In one implementation, the processing unit 1002 is specifically configured to:

performing splicing processing based on the second confidence coefficient corresponding to each command word, the first confidence coefficient corresponding to each command word and the associated feature to obtain a verification feature;

Determining a third confidence level of the voice data of the verification time window corresponding to each command word based on the verification features;

and if the command words with the third confidence coefficient being greater than or equal to the second threshold value exist in the command word set, determining the command words with the third confidence coefficient being greater than or equal to the second threshold value and the third confidence coefficient being the largest as the command words with the result that the voice data of the verification time window hit in the command word set.

In one implementation, the processing unit 1002 is specifically configured to:

determining a third confidence coefficient of the voice data of the verification time window and the corresponding command word based on the first confidence coefficient and the associated feature corresponding to each command word;

In one implementation, said each command word in said set of command words has a plurality of syllables; the processing unit 1002 is specifically configured to:

Acquiring a syllable output unit set, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;

according to the audio characteristics respectively corresponding to the voice data of the K voice frames, determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set;

determining syllable output units corresponding to syllables of command words hit by the voice data of the target time window as verification syllable output units, and determining a voice frame with the highest probability corresponding to the verification syllable output units in the K voice frames as a target voice frame;

and determining a verification time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame.

In one implementation, the association features include at least one of: the first average energy of the voice data in the verification time window, the effective voice duty cycle of the voice data in the verification time window, the signal to noise ratio of the voice data in the verification time window, and the number of voice frames in the verification time window.

In one implementation, the processing unit 1002 is further configured to:

determining a first average energy of the speech data of the verification time window based on the energy of the speech data of each speech frame in the verification time window;

determining an effective voice duty ratio of voice data of the verification time window according to the number of effective voice frames in the verification time window, wherein the effective voice frames are voice frames with energy larger than or equal to the first average energy;

and determining the signal-to-noise ratio of the voice data of the verification time window according to the second average energy of the effective voice frames in the verification time window and the first average energy.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device described in the present embodiment includes: processor 1101, memory 1102. Optionally, the electronic device may further include a network interface or a power module. Data may be exchanged between the processor 1101 and the memory 1102.

The processor 1101 may be a central processing unit (Central Processing Unit, CPU) that may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The network interface may include input devices, such as a control panel, microphone, receiver, etc., and/or output devices, such as a display screen, transmitter, etc., which are not shown.

The memory 1102 may include read-only memory and random access memory, and provides program instructions and data to the processor 1101. A portion of memory 1102 may also include non-volatile random access memory. Wherein the processor 1101, when calling the program instructions, is configured to perform:

In one implementation, the processor 1101 is specifically configured to:

In one implementation, said each command word in said set of command words has a plurality of syllables; the processor 1101 is specifically configured to:

In one implementation, the processor 1101 is further configured to:

Optionally, the program instructions may further implement other steps of the method in the above embodiment when executed by the processor, which is not described herein.

The present application also provides a computer readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the above method, such as the method performed by the above electronic device, which is not described herein in detail.

Alternatively, a storage medium, such as a computer readable storage medium, to which the present application relates may be nonvolatile or may be volatile.

Alternatively, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like. The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement some or all of the steps of the above-described method. For example, the computer instructions are stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device (i.e., the electronic device described above), and executed by the processor, cause the computer device to perform the steps performed in the embodiments of the methods described above. For example, the computer device may be a terminal, or may be a server. The foregoing has described in detail a data processing method, apparatus, electronic device, program product and medium provided in the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of data processing, the method comprising:

performing splicing processing based on the first confidence coefficient corresponding to each command word and the associated feature to obtain a verification feature, and determining a third confidence coefficient corresponding to the voice data of the verification time window and each command word based on the verification feature;

2. The method according to claim 1, wherein determining whether the voice data of the target time window hits a command word in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames comprises:

3. The method of claim 2, wherein the performing a stitching process based on the first confidence level corresponding to each command word and the associated feature to obtain a verification feature includes:

and performing splicing processing based on the second confidence coefficient corresponding to each command word, the first confidence coefficient corresponding to each command word and the associated feature to obtain the verification feature.

4. The method of claim 1, wherein said each command word in said set of command words has a plurality of syllables; the determining a verification time window associated with the current speech frame includes:

5. The method of claim 1, wherein the associated features comprise at least one of: the first average energy of the voice data in the verification time window, the effective voice duty cycle of the voice data in the verification time window, the signal to noise ratio of the voice data in the verification time window, and the number of voice frames in the verification time window.

6. The method of claim 5, wherein the method further comprises:

7. A data processing apparatus, the apparatus comprising:

the processing unit is further configured to determine, according to audio features respectively corresponding to the voice data of the plurality of voice frames in the verification time window, a first confidence level of the voice data in the verification time window and each command word in the command word set, and determine, based on the voice data of the plurality of voice frames in the verification time window, associated features corresponding to the verification time window;

the processing unit is further configured to perform a stitching process based on the first confidence coefficient corresponding to each command word and the associated feature, so as to obtain a verification feature, and determine a third confidence coefficient corresponding to each command word and the voice data of the verification time window based on the verification feature; and if the command words with the third confidence coefficient being greater than or equal to the second threshold value exist in the command word set, determining the command words with the third confidence coefficient being greater than or equal to the second threshold value and the third confidence coefficient being the largest as the command words with the result that the voice data of the verification time window hit in the command word set.

8. An electronic device comprising a processor, a memory, wherein the memory is for storing a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-6.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-6.