CN115101063B

CN115101063B - Low-computation-power voice recognition method, device, equipment and medium

Info

Publication number: CN115101063B
Application number: CN202211014435.2A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2023-01-06
Anticipated expiration: 2042-08-23
Also published as: CN115101063A

Abstract

The low-computation-power speech recognition method includes the steps of obtaining a plurality of speech signals by performing operations such as framing and windowing on input speech, combining the speech signals to form speech windows, putting the speech windows into a trained deep neural network wake-up model to be processed to obtain a column of posterior probabilities of phoneme mark types corresponding to each speech window, combining the posterior probabilities of the speech windows to form a decoding matrix, and calculating scores of specified keywords (wake-up words) in the decoding matrix to judge whether wake-up words exist in the input speech.

Description

Low-computational-power voice recognition method, device, equipment and medium

Technical Field

The present application relates to the field of speech recognition interaction, and in particular, to a low-computation-effort speech recognition method, apparatus, device, and medium.

Background

With the popularization of computer technology, people's lives nowadays gradually enter the intelligent era. Not only computers, mobile phones and PADs, but also emerging intelligent technologies such as smart televisions, intelligent navigation and smart homes are applied to various aspects of people's clothes and food residents; more and more intelligent devices introduce a voice interaction control system, people do not need to control the intelligent devices by using hands or a remote controller, and the like, but send a specific voice awakening instruction to the intelligent devices to awaken the intelligent devices to enter a voice control mode, and the operation of the intelligent devices can be controlled by voice;

voice wakeup is usually based on convolution calculation to recognize and acquire wakeup words, therefore, most voice wakeup algorithms contain convolution calculation, and convolution acceleration depends on parallelism, however, some AIOT devices (artificial intelligence internet of things devices) use chips whose architecture design does not support parallelism and cannot perform real-time convolution calculation, which poses a challenge on how to deploy voice wakeup algorithms on such chips.

Disclosure of Invention

The application provides a low-computational-effort voice recognition method, device, equipment and medium, and aims to solve the problem that some chips used by artificial intelligence Internet of things equipment in the prior art cannot be deployed with a voice awakening algorithm due to the fact that the structural design of the chips does not support parallelism.

In order to solve the above technical problem, in a first aspect, the present application provides a low-computation speech recognition method, including:

performing frame windowing on input voice to obtain a plurality of voice signals;

combining the plurality of voice signals to form a plurality of voice windows;

inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;

combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.

Preferably, the step of performing frame-division windowing on the input speech to obtain a plurality of speech signals includes:

acquiring input voice;

and carrying out windowing and framing processing on the voice at a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.

Preferably, the step of combining the plurality of speech signals to form a plurality of speech windows comprises:

and combining the plurality of voice signals in sequence according to a preset window step length by taking the first voice signal as an initial voice signal and the preset window length as a window length to form a plurality of voice windows.

Preferably, the deep neural network wake-up model includes a feature input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer includes at least one fully-connected layer, and a nonlinear activation function is provided behind the fully-connected layer of each layer.

Preferably, the step of inputting the multiple speech windows into a trained deep neural network wake-up model for processing to obtain a plurality of posterior probabilities of a list of phoneme symbol types matching the speech windows includes:

inputting the multiple voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:

acquiring a corresponding voice window when in processing, and generating a current voice window;

acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows;

splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window;

multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result;

and inputting the summation result into the full-connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.

Preferably, the training step of the deep neural network wake model includes:

training an initial model based on a deep neural network based on time sequence classification by using a universal corpus until the initial model network converges to obtain a preliminary convergence model based on the deep neural network;

and using the general corpus and the specific awakening word corpus to train the primary convergence model for the second time until the primary convergence model converges for the second time, so as to obtain the deep neural network awakening model.

Preferably, the step of calculating the score of the specified keyword in the decoding matrix includes:

acquiring the length of the decoding matrix;

and when the length of the decoding matrix exceeds a preset length, calculating the score of the specified keyword in the decoding matrix according to a preset step length.

In a second aspect, the present application further provides a low-computation speech recognition apparatus, comprising:

the frame windowing module is used for performing frame windowing processing on input voice to obtain a plurality of voice signals;

the voice window generating module is used for combining the voice signals to form a plurality of voice windows;

the posterior probability calculation module is used for inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a column of phoneme mark types matched with the voice windows;

a decoding matrix generating module, configured to combine the posterior probabilities of the multiple speech windows to form a decoding matrix;

and the judging module is used for calculating the score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the specified keyword.

In a third aspect, the present application further provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the low-computation speech recognition method according to any one of the above items when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the low-computation speech recognition method of any of the above.

The method for recognizing the low-computation-power voice comprises the steps of obtaining a plurality of voice signals by performing operations such as framing and windowing on input voice, combining the voice signals to form voice, inputting the voice signals into a trained deep neural network awakening model in a plurality of voice windows for processing to obtain a column of posterior probability of phoneme mark types corresponding to each voice window, combining the posterior probabilities of the voice windows to form a decoding matrix, and calculating scores of specified keywords (awakening words) in the decoding matrix to judge whether awakening words exist in the input voice.

Drawings

FIG. 1 is a flow diagram of a low-computation speech recognition method according to an embodiment;

FIG. 2 is a schematic diagram of a low computational power speech recognition apparatus according to an embodiment;

FIG. 3 is a block diagram illustrating a computer device according to an embodiment.

The implementation, functional features and advantages of the object of the present application will be further explained with reference to the embodiments, and with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the content clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, units, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, units, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, a low-computation-power speech recognition method provided in an embodiment of the present application includes:

s1: performing frame windowing on input voice to obtain a plurality of voice signals;

s2: combining the plurality of voice signals to form a plurality of voice windows;

s3: inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;

s4: combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

s5: and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.

As described in step S1, before performing feature extraction on the speech input into the chip, a windowing and framing operation needs to be performed on the speech. Specifically, a speech with a window length of 25 milliseconds in a speech window may be preset, and a frame shift between adjacent frames is 10 milliseconds, that is, each frame is divided into 10 milliseconds, and each 10 millisecond speech is divided into a speech signal, further, because the phoneme duration of each language phoneme unit is different, the frame shift length may be set according to the language of the input speech to include the phoneme duration of its pronunciation unit;

as described in step S2, the input speech is already divided into a plurality of speech signals in step S1, and therefore, a plurality of speech signals are combined to form a plurality of speech windows, the window length of a speech window can be set by itself, the moving step length of a speech window can also be set by itself, the moving step length of a speech window can be half of the window length, in some embodiments, the moving step length can be one third of the window length or another ratio, generally, the moving step length of a speech window is smaller than the window length, and in this embodiment, the moving step length of a speech window, the window length and the length of a speech signal are not limited. Specifically, the windowing may be performed according to a time frame of the voice signal, that is, starting from a first frame obtained by dividing, if each voice signal is 10 milliseconds, the window length of the voice window is 30 milliseconds, and the step length of the voice window is 10 milliseconds, the first voice window includes the first voice signal, the second voice signal, and the third voice signal, and after the voice window moves rightward for 10 milliseconds, the second voice window formed includes the second voice signal, the third voice signal, and the fourth voice signal;

specifically, a voice window formed after performing windowing and framing operation on a voice recorded on line is an extracted voice feature, and the voice feature may be a filter bank (hereinafter, referred to as fbank) feature or other voice features, such as a Mel Frequency Cepstrum Coefficient (MFCC) feature, which is not limited in this embodiment, and in the embodiment of the present invention, the voice feature is an fbank feature, and 40-dimensional fbank is used, but the dimension of fbank may be set according to an actual usage scenario;

as described in the step S3, after the step S2, a plurality of speech windows including speech features can be obtained, all the speech windows are sequentially input into the trained deep neural network wake-up model, the deep neural network wake-up model can correspond to a phoneme label type preset in the deep neural network wake-up model according to phonemes in the speech windows, after weighting and averaging of the deep neural network wake-up model, a posterior probability of a phoneme label type corresponding to each speech window can be calculated, and a posterior probability of a list of phoneme label types can be obtained after a speech window is calculated by the deep neural network wake-up model; the posterior probability of the phoneme mark types in a column can be obtained after the processing of the deep neural network awakening model;

as described in step S4 above, after step S3, a plurality of a-row posterior probabilities of the phoneme symbol types are obtained, and the posterior probabilities of the phoneme symbol types in a row are combined to form a decoding matrix;

as described in step S5, after step S4, the score of the specified keyword may be calculated in the decoding matrix, and if the score of the specified keyword in the decoding matrix exceeds a certain threshold, it is determined that the decoding matrix contains the specified keyword, that is, it indicates that the speech contains the wakeup word;

therefore, in the scheme, a plurality of speech windows are formed after the input speech is subjected to framing and windowing processing, then the plurality of speech windows are placed into a trained deep neural network wake-up model for processing, the posterior probabilities of a column of phoneme mark types corresponding to each speech window are obtained, then the posterior probabilities of the plurality of speech windows are combined, a decoding matrix is formed, and then scores of specified keywords (wake-up words) in the decoding matrix are calculated to judge whether the input speech has the wake-up words.

In an embodiment, the step S1 of performing frame-division windowing on the input speech, and acquiring a plurality of speech signals includes:

acquiring input voice;

and carrying out windowing and framing processing on the voice according to a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.

As described above, when performing feature extraction on speech input into a chip, it is necessary to perform windowing and framing on the speech. Specifically, a speech with a window length of 25 milliseconds may be preset, and a frame shift between adjacent frames is 10 milliseconds, that is, each frame is divided into 10 milliseconds every 10 milliseconds, and each 10 millisecond speech is divided into a speech signal.

In one embodiment, the step S2 of combining the plurality of speech signals to form a plurality of speech windows includes:

As described above, after the input speech is framed, a plurality of speech signals are combined to form a plurality of speech windows, each speech window is the speech feature of the extracted speech signal, the window length of each speech window can be set by itself, the moving step length of each speech window can also be set by itself, the moving step length of each speech window can be half the window length, in some embodiments, the moving step length can also be one third of the window length or other occupation ratio, generally, the moving step length of each speech window can be smaller than the window length, and in this embodiment, the moving step length of each speech window, the window length and the length of the speech signal are not limited. Specifically, the windowing may be performed in accordance with the time frame of the speech signal, that is, from the first frame of the division, if each speech signal is 10 ms, the window length of the speech window is 30 ms, and the step length of the speech window is 10 ms, the first speech window includes the first speech signal, the second speech signal, and the third speech signal, and after the speech window is moved rightward for 10 ms, the second speech window formed includes the second speech signal, the third speech signal, and the fourth speech signal.

In one embodiment, the deep neural network wake-up model comprises a characteristic input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer comprises at least one fully-connected layer, and a nonlinear activation function is arranged behind the fully-connected layer of each layer.

As described above, the Deep Neural network wake-up model includes the feature input layer, the hidden layer, the output layer, and the attention layer, where the feature input layer is used to input a speech window containing speech features, the hidden layer is a fully connected layer, and is used to weight and average the speech features of the speech window input by the feature input layer, and then send the processing result to the output layer, and the attention layer can make the Deep Neural network wake-up model pay more attention to a specific part of the input model to more effectively complete the task at hand, that is, guide the model to extract phonemes of specific words in the input speech, the number of the hidden layers may be one layer or multiple layers, and at each fc (fully connected layer), a relu nonlinear activation function is connected, and the specific number of layers may depend on the computational power of the chip and the definition of a Deep Neural network (Deep Neural network, hereinafter abbreviated as DNN) model; the backward propagation of fc (fully-connected layer) network with the number of layers within 5 is effective, so in the present embodiment, the number of hidden layers is preferably 3, and too many layers may cause the gradient vanishing accuracy to decrease instead.

In one embodiment, the step of inputting the plurality of speech windows into a trained deep neural network wake model for processing to obtain a plurality of posterior probabilities S5 of a list of phoneme symbol types matching the speech windows includes:

acquiring a voice window corresponding to the voice window in the process of processing, and generating a current voice window;

and inputting the summation result into the full connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.

As mentioned above, the plurality of voice windows are put into the trained deep neural network wake-up model according to the preset sequence for processing, that is, the generated voice windows are put into the trained deep neural network wake-up model in sequence according to the sequence of the voice windows generated in the moving direction of the windows for processing, a first voice window and a second voice window are generated in sequence according to the moving direction of the windows, \ 8230, a Nth voice window, the first voice window is put into the deep neural network for processing first, then the second voice window is put into the deep neural network for processing in sequence, \ 8230, the Nth voice window is put into the deep neural network wake-up model for processing in sequence, meanwhile, the server marks the voice window corresponding to the voice window being processed by the deep neural network wake-up model as the current voice window, and marks the preset number of voice windows processed before the current voice window as the historical voice window, wherein the history voice window is the feature of an active layer (full connection layer) before outputting the phoneme classification probability after being processed by the neural network, specifically, if the deep neural network wake-up model is currently processing a fifth voice window, the server may mark the preset number of voice windows before the fifth voice window, for example, the second, third and fourth voice windows as history voice windows (the preset number is 3, and the specific preset number may be set by itself without limitation), and when the deep neural network wake-up model processes the fifth voice window, the voices in the second, third and fourth voice windows and the fifth voice window may be spliced and combined in the full connection layer close to the output layer before outputting the phoneme classification probability after being processed by the neural network, generating a spliced voice window, and processing the spliced voice window together by using the characteristics extracted from the 3 historical windows as auxiliary information of the current voice window; therefore, the size of the receptive field is improved, and the accuracy is further improved; meanwhile, due to the characteristics after compression, the calculation amount and the storage occupation are greatly reduced; specifically, in this embodiment, 10 frames are selected for one-time prediction, where the 10 frames are 100ms, that is, 0.1s, and are sufficient to cover the duration of one phoneme, the fbank features of the 10 frames are spliced, and the dimension of each frame is 40 dimensions, then the input speech window is 10 × 40 as an input feature, where the selection of the 10 frames may be set according to different languages to include the durations of the phonemes of the pronunciation units, and during streaming processing, the step size may be set to stride =5, so as to reduce the amount of calculation and ensure the accuracy of recognition. The general step length is half of the window length, after the historical speech window and the current speech window are spliced, the weight is learned, namely the historical speech window and the current speech window are multiplied by a weight matrix of 4 x 64, the weight matrix is a learnable matrix, then summation is carried out in the window direction, then the summation result is input into a full connection layer for feature processing, the features are compressed into 64 dimensions, the posterior probability of a column of phoneme mark types of the current speech window can be obtained through calculation, and the calculated amount can be reduced.

In one embodiment, the training step of the deep neural network wake model includes:

As described above, when the deep neural network wake model is trained, the initial model based on the deep neural network is trained by using a gradient descent method based on a CTC timing Classification training (Connectionist Temporal Classification) criterion until the network converges, where the network converges refers to a loss value that gradually descends, and the conditions for stopping training by the network may be as follows: for example, the word error rate of the target WER (word error rate) for speech recognition training is not decreased any more, or a certain number of epochs (iteration times) are trained; the preliminary convergence model based on the deep neural network can be obtained, then, the training corpus is replaced to be the general corpus and the specific awakening word corpus, in each batch (the number of training samples), the training corpus is combined according to a certain proportion, for example, the proportion of the general corpus is 30% and the proportion of the specific corpus is 70% to the preliminary convergence model, the final deep neural network awakening model is obtained after the preliminary convergence model reaches the second convergence, so that the recognition capability of the deep neural network awakening model to the general corpus can be kept, the recognition capability of the specific awakening word is improved, and meanwhile, overfitting of the deep neural network awakening model can be prevented.

In one embodiment, the step of calculating the score of the specified keyword in the decoding matrix comprises:

acquiring the length of the decoding matrix;

As described above, the posterior probabilities of a column of phoneme symbol types predicted by each speech window are combined to form a decoding matrix, a 2-second decoding matrix can be processed once, and the step size of the decoding matrix processing can be set to stride = 100ms; and calculating the score of the appointed keyword in the decoding matrix for the decoding matrix of each window, and if the score exceeds a certain threshold value, determining that the decoding matrix contains the awakening word (namely the appointed keyword).

Referring to fig. 2, in a second aspect, the present application further provides a low-computation speech recognition apparatus comprising:

a framing and windowing module 100, configured to perform framing and windowing processing on input speech to obtain multiple speech signals;

a speech window generating module 200, configured to combine the multiple speech signals to form multiple speech windows;

a posterior probability calculation module 300, configured to input the multiple speech windows into a trained deep neural network wake-up model for processing, so as to obtain posterior probabilities of a list of phoneme symbol types matching the speech windows;

a decoding matrix generating module 400, configured to combine the posterior probabilities of the multiple speech windows to form a decoding matrix;

the determining module 500 is configured to calculate a score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold, determine that the decoding matrix contains the specified keyword.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing low computational speech recognition data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a low-computational speech recognition method.

It will be understood by those skilled in the art that the structure shown in fig. 3 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation to the computer device to which the present application is applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a low-computation speech recognition method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, apparatus, article, or method comprising the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims

1. A low-computational speech recognition method, comprising:

performing framing and windowing processing on input voice to obtain a plurality of voice signals;

combining the plurality of voice signals to form a plurality of voice windows;

inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows, wherein the deep neural network awakening model comprises a characteristic input layer, a hidden layer, an output layer and an attention layer, the hidden layer comprises at least one full-connection layer, and a nonlinear activation function is arranged behind the full-connection layer of each layer, and then the method comprises the following steps: acquiring a corresponding voice window during processing, and generating a current voice window; acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows; splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window; multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result; inputting the summation result into the full connection layer to obtain the posterior probability of a list of phoneme symbol types of the current voice window;

2. The low-computational-power speech recognition method of claim 1, wherein the step of performing frame windowing on the input speech to obtain the plurality of speech signals comprises:

acquiring input voice;

3. The low-computational speech recognition method of claim 1 wherein the step of combining the plurality of speech signals to form a plurality of speech windows comprises:

4. The low-computational speech recognition method of claim 1 wherein the training step of the deep neural network wake model comprises:

and using the general linguistic data and the specific awakening word linguistic data to carry out secondary training on the primary convergence model until the primary convergence model is converged for the second time, and obtaining the deep neural network awakening model.

5. The low-computation voice recognition method of claim 1, wherein the step of calculating a score for a given keyword in the decoding matrix comprises:

acquiring the length of the decoding matrix;

6. A low-computation speech recognition apparatus, comprising:

a posterior probability calculation module, for inputting a plurality of speech windows into the trained deep neural network awakening model for processing, obtaining a plurality of posterior probabilities with a list of phoneme symbol types matched with the speech windows, the deep neural network awakening model includes a feature input layer, a hidden layer, an output layer and an attention layer, wherein, the hidden layer includes at least one full connection layer, at each layer is provided with a nonlinear activation function after the full connection layer, including: acquiring a corresponding voice window when in processing, and generating a current voice window; acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows; splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window; multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result; inputting the summation result into the full connection layer to obtain the posterior probability of a list of phoneme symbol types of the current voice window;

a decoding matrix generating module for combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

and the judging module is used for calculating the score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the specified keyword is contained in the decoding matrix.

7. A computer arrangement comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the low-computation speech recognition method of any of claims 1 to 5.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the low-computation speech recognition method of any one of claims 1 to 5.