CN114242068B

CN114242068B - Voice processing method, device, electronic device and storage medium

Info

Publication number: CN114242068B
Application number: CN202111391231.6A
Authority: CN
Inventors: 袁志伟
Original assignee: FAW Group Corp
Current assignee: FAW Group Corp
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2025-04-18
Anticipated expiration: 2041-11-23
Also published as: CN114242068A

Abstract

The embodiment of the present invention discloses a voice processing method, device, electronic device and storage medium. The method comprises: after playing the guidance information, obtaining the first voice information input by the user, wherein the guidance information is generated based on the second voice information input by the user; if the first voice information matches the guidance information, obtaining the first voice analysis result according to the first voice information, wherein the second voice information is associated with the first voice analysis result; and playing the first feedback information according to the first voice analysis result. Through the scheme in the embodiment of the present invention, it is possible to dynamically guide the user to input voice information, thereby improving the working efficiency of voice processing; associating the incomprehensible voice information with the understandable voice information result, thereby realizing online learning, which can improve the accuracy of the voice processing result and further improve the user experience.

Description

Voice processing method, device, electronic equipment and storage medium

Technical Field

Embodiments of the present invention relate to natural language processing technologies, and in particular, to a method and apparatus for processing speech, an electronic device, and a storage medium.

Background

With the continuous development of the intelligent question-answering field, the voice assistant is increasingly widely applied to the fields of intelligent home, intelligent vehicle, intelligent customer service and the like.

The existing voice assistant often generates a situation that the user instruction cannot be understood, and the voice assistant generally adopts boring or hard replies such as 'not known me' to reject the user instruction in the face of the situation that the user instruction cannot be understood. The voice assistant is not intelligent enough, and the user experience and the completion rate of the user task are greatly influenced.

Disclosure of Invention

The embodiment of the invention provides a voice processing method, a voice processing device, electronic equipment and a storage medium, which can improve the intelligent degree of voice processing and user experience.

In a first aspect, an embodiment of the present invention provides a method for processing speech, including:

After playing the guide information, acquiring first voice information input by a user, wherein the guide information is generated based on second voice information input by the user;

If the first voice information is matched with the guide information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result;

and playing the first feedback information according to the first voice analysis result.

In a second aspect, an embodiment of the present invention provides a speech processing apparatus, including:

The system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring first voice information input by a user after playing guide information, and the guide information is generated based on second voice information input by the user;

The analysis module is used for acquiring a first voice analysis result according to the first voice information if the first voice information is matched with the guide information, wherein the second voice information is associated with the first voice analysis result;

and the feedback module is used for playing the first feedback information according to the first voice analysis result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the speech processing method according to any one of the embodiments of the present invention when executing the program.

In a fourth aspect, embodiments of the present invention further provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech processing method according to any of the embodiments of the present invention.

In the embodiment of the invention, the guiding information is played for the user based on the second voice information input by the user, the first voice information input by the user is obtained, if the first voice information is matched with the guiding information, the first voice analysis result is obtained according to the first voice information, wherein the second voice information is related to the first voice analysis result, and the first feedback information is played according to the first voice analysis result. The method and the device can dynamically guide the user to input the voice information, so that the working efficiency of voice processing is improved, and the unintelligible voice information is associated with an understandable voice information result, so that online learning is realized, the accuracy of the voice processing result can be improved, and the user experience is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice processing method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a speech processing method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a speech processing device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a schematic flow chart of a voice processing method according to an embodiment of the present invention, where the method according to the embodiment of the present invention is applicable to an intelligent question-answering scenario, and the method may be performed by a voice processing device according to an embodiment of the present invention, and the device may be implemented in a software and/or hardware manner. In a specific embodiment, the apparatus may be integrated in an electronic device, which may be a computer, a server, or the like. The following embodiments will be described taking the example of the integration of the apparatus in an electronic device, and referring to fig. 1, the method may specifically include the following steps:

step 101, after playing the guiding information, acquiring first voice information input by a user, wherein the guiding information is generated based on second voice information input by the user.

The second voice information input by the user can be voice information such as a voice instruction, a question or a chat sent by the user based on own needs. After receiving the second voice information sent by the user, the second voice information can be converted into text information through the voice recognition system, so that subsequent processing is facilitated. Of course, the user may also directly input the second voice information in text form through the device. After the second voice information is received, semantic analysis can be performed on the second voice information through a semantic understanding model, and further, guiding information is generated based on the second voice information input by the user. Further, the guiding information is played for the user, and meanwhile text information corresponding to the guiding information can be displayed for the user. After receiving the guiding information, the user can input the first voice information according to the guiding information, and further, the first voice information input by the user is obtained.

For example, the user wants to close the window by using the vehicle-mounted voice assistant, and the second voice information input by the user is "window is closed". After receiving the second voice information input by the user, the voice assistant performs semantic analysis on the second voice information and generates guide information for the second voice information according to the analysis result. Assuming that the generated guidance information is "close all windows", the guidance information is played for the customer that "you can say" close all windows "or" you are to close all windows ". The user may input a first user instruction, such as "close all windows" or "yes", based on the guidance information. Further, first voice information input by a user is acquired.

Step 102, if the first voice information is matched with the guiding information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result.

Specifically, after receiving first voice information input by a user, the first voice information is matched with the guide information. And if the first voice information is matched with the guide information, acquiring a first voice analysis result according to the first voice information. The analysis result of the first voice information is the analysis result of the guiding information, and the second voice information is associated with the first voice analysis result. For example, the second voice information input by the user is "all windows are closed", and the generated guidance information is "all windows are closed", and the first voice information input by the user is "all windows are closed". And obtaining that the first voice information input by the user is matched with the guide information through the semantic matching model, and knowing the first voice analysis result according to the guide information and the first voice information, wherein all windows of the user need to be closed. Further, it is known that, when the user inputs "window is closed", the analysis result is "all windows are closed" analysis result.

Step 103, playing the first feedback information according to the first voice analysis result.

Specifically, after the user inputs the second voice information, generating guide information for the user according to the second voice information, and waiting for the user to input the first voice information. After the user inputs the first voice information, if the first voice information is matched with the guide information, a first voice analysis result is generated according to the analysis results of the guide information and the first voice information, and the first voice analysis result is played or displayed for the user. And associating the second speech information with the first speech analysis result.

According to the technical scheme of the embodiment, after the guiding information is played, first voice information input by a user is obtained, if the first voice information is matched with the guiding information, a first voice analysis result is obtained according to the first voice information, wherein second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result. The method and the device can dynamically guide the user to input the voice information, so that the working efficiency of voice processing is improved, and the unintelligible voice information is associated with an understandable voice information result, so that online learning is realized, the accuracy of the voice processing result can be improved, and the user experience is further improved.

Fig. 2 is another flow chart of a voice processing method according to an embodiment of the present invention, and the steps of the voice processing method are refined based on the above embodiment. As shown in fig. 2, the method of this embodiment specifically includes the following steps:

step 201, obtaining second voice information input by a user, and analyzing the second voice information.

Specifically, user input of second voice information is received. And when the second voice information is voice, converting the voice information into text information. Of course, the user may also directly input text as the second voice information. After the second voice information is received, carrying out semantic analysis on the second voice information through a semantic understanding model.

Step 202, if the second voice information analysis fails, determining whether third voice information exists in the first database, wherein the similarity between the third voice information and the second voice information is greater than or equal to a preset threshold value. Further, if the second voice information analysis fails, determining whether third voice information exists in the first database. Specifically, the first database may be constructed in real time based on the user interaction log, and the first database stores guidance information that can be correctly parsed by the semantic understanding system. As shown in table 1 below:

TABLE 1

Guidance information	Analysis result
		Closing sunshade curtain	Intent is to close the sunshade curtain
Half of the skylight is closed	The intention is that the skylight is controlled, the action is closed, and the value is 50 percent
		Beijing tomorrow weather	Intent is weather, place, beijing, time, tomorrow

It should be noted that the guidance information and the analysis result shown in table 1 are merely examples, and do not constitute a final limitation of setting the actual guidance information and the analysis result, and in practical application, the relevant data may be adjusted according to actual needs, which is not specifically limited here.

Further, it is determined whether third voice information exists in the first database, wherein a similarity between the third voice information and the second voice information is greater than or equal to a preset threshold, as shown in the following table 2. It should be noted that the data shown in table 2 is only an example, and does not constitute a final limitation on the setting of actual data, and the relevant data may be adjusted according to actual needs in practical application, which is not specifically limited herein.

TABLE 2

Second voice information	Third voice information	Similarity scoring
			Screen window close	Closing the sunshade curtain	0.9800629019737244
Screen window close	Close window well	0.5422488117218018

For example, when the second voice message is "screen close", the similarity threshold is set to 0.8, and the third voice message is a voice message having a similarity with the second voice message of 0.8 or more. It is determined whether third voice information is present in the first database. Assuming that the first database has a "blind closed" and the "blind closed" has a similarity to the "screen closed" of greater than 0.8, it can be determined that the third voice information is present in the first database. If a plurality of third voice information exists in the database 1, the third voice information is determined as the highest similarity threshold.

And setting different thresholds according to the second voice information of different types, and taking the information with the similarity with the second voice information being greater than or equal to a preset threshold as third voice information. When the second voice information input by the user is voice information of the type such as the query place, the similarity threshold can be adaptively adjusted to be high. For example, the second voice information is "navigate to a technology road", the first database stores "navigate to a technology road", and if the similarity threshold is set to be low, the "navigate to a technology road" is used as the third voice information, so that the customer experience is affected.

Step 203, if the third voice information exists in the first database, generating guiding information according to the third voice information.

After determining that the third voice information exists in the first database, reading the third voice information with the highest similarity value in the first database, analyzing the third voice information through a semantic understanding model, and generating guide information according to the third voice information. Further, the guiding information is played for the user to guide the user.

Step 204, after playing the guiding information, obtaining the first voice information input by the user.

After receiving the guiding information, the user can input the first voice information according to the guiding information, and further, the first voice information input by the user is obtained.

Step 205, if the first voice information matches the guiding information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result.

The analysis result of the first voice information is the analysis result of the guiding information, and the second voice information is associated with the first voice analysis result.

Specifically, after the user inputs the second voice information, generating guide information for the user according to the second voice information, and waiting for the user to input the first voice information. After the user inputs the first voice information, if the first voice information is matched with the guide information, a first voice analysis result is generated according to the analysis results of the guide information and the first voice information, and the first voice analysis result is played or displayed for the user.

In this embodiment, optionally, the second voice information is associated with the first voice analysis result and stored in the second database.

Specifically, when the first voice information is matched with the guide information, according to the first voice information, after a first voice analysis result is obtained, the analysis result of the first voice information is associated to a voice analysis result of the second information. And the second voice information and the first voice parsing result may be stored to a second database. The second database can be used as training corpus for upgrading iteration of the subsequent semantic understanding model.

The second voice information is associated with the first voice analysis result and stored in the second database, so that the subsequent semantic understanding model upgrading iteration is facilitated, the accuracy of the semantic analysis result can be improved, and the user experience is further improved

If the first voice information is not matched with the guiding information, the second voice information input by the user is acquired, and semantic analysis is performed on the second voice information through a semantic understanding model (step 201 is executed).

Step 206, playing the first feedback information according to the first voice analysis result.

By setting the similarity threshold to determine the third voice information and generating the guiding information according to the third voice information, the accuracy of the result fed back to the user can be improved, and the user experience is further improved.

Step 207, if the third voice information does not exist in the first database, determining whether the second voice information is boring information.

After determining that the third voice information does not exist in the first database, further determining whether the second voice information is boring information. For example, the second voice information input by the user is "i am boring" and "i am boring" no third voice information with similarity greater than a preset threshold value in the first database. The user can be played with the voice message of "whether boring".

And step 208, if the second voice information is chatting information, playing third feedback information, wherein the third feedback information is chatting.

If the second voice information input by the user is the boring information, the boring operation is played back for the user. Such as "that we chat bar" etc.

Step 209, if the second voice information is not boring information, playing fourth feedback information, where the fourth feedback information is used to indicate that voice understanding fails.

If it is determined that the second voice information input by the user is not boring information and the third voice information does not exist in the first database, feeding back information of voice understanding failure to the user. Such as "not understood" and the like.

By the steps, when the user does not adopt the guide information, the user needs are fully understood by adopting the exit strategy, the boring service is provided for the user, and the user experience is improved.

Step 210, if the second voice information analysis is successful, a second voice analysis result is obtained, and the second feedback information is played according to the second voice analysis result.

For example, the second voice information input by the user is "all windows are closed", the analysis result of the second voice information, which is intended to be "all windows are closed", is analyzed through the semantic understanding model ("all windows are closed" stored in the first database), and the semantic understanding model can obtain the analysis result through voice analysis). The feedback "good" is played for the user, all windows are closed for you.

Step 211, associating the second voice information with the second voice analysis result and storing the second voice information in the first database.

For example, the first database stores the analysis result of "close all windows". After the second voice information ("closing all windows") is successfully parsed, the second voice information is associated with the second voice parsing result and stored in the first database.

In this embodiment, optionally, the first database and the second database are constructed in real time based on the user interaction log.

By constructing the first database and the second database in real time based on the user interaction log, manual labeling is not needed, human resources are saved, and the working efficiency of the voice processing method can be further improved.

In the embodiment, second voice information input by a user is acquired and analyzed, if the second voice information is analyzed to fail, guide information is generated and played according to the second voice information, if the second voice information is analyzed to be successful, a second voice analysis result is acquired, second feedback information is played according to the second voice analysis result, and the second voice information and the second voice analysis result are associated and stored in the first database. The second voice information which cannot be understood is matched with the third voice information which can be understood, so that the second voice information input by a user is dynamically guided, the learning efficiency of a voice processing method is improved, the iterative learning of the voice processing method is realized, and the accuracy of a voice processing result is improved.

Fig. 3 is a block diagram of a speech processing device according to an embodiment of the present invention, where the device is adapted to execute the speech processing method according to the embodiment of the present invention. As shown in fig. 3, the apparatus may specifically include:

The obtaining module 301 is configured to obtain first voice information input by a user after playing guide information, where the guide information is generated based on second voice information input by the user;

The parsing module 302 is configured to obtain a first voice parsing result according to the first voice information if the first voice information is matched with the guiding information, where the second voice information is associated with the first voice parsing result;

and the feedback module 303 is configured to play the first feedback information according to the first voice analysis result.

Optionally, the obtaining module 301 is further configured to obtain the second voice information input by the user, and parse the second voice information;

The parsing module 302 is further configured to generate and play the guiding information according to the second voice information if the parsing of the second voice information fails.

Optionally, the parsing module 302 is specifically configured to determine whether third voice information exists in the first database, where a similarity between the third voice information and the second voice information is greater than or equal to a preset threshold, and if the third voice information exists in the first database, generate the guiding information according to the third voice information.

Optionally, the parsing module 302 is further configured to obtain a second voice parsing result if the second voice information is parsed successfully, play second feedback information according to the second voice parsing result, and associate the second voice information with the second voice parsing result and store the second voice information in the first database.

Optionally, the parsing module 302 is further configured to determine whether the second voice information is boring information if the third voice information does not exist in the first database, play third feedback information if the second voice information is boring information, where the third feedback information is boring, and play fourth feedback information if the second voice information is not boring, where the fourth feedback information is used to indicate that voice understanding fails.

Optionally, the parsing module 302 is further configured to associate and store the second voice information with the first voice parsing result to a second database.

Optionally, the first database and the second database are constructed in real time based on user interaction logs.

The voice processing device provided by the embodiment of the invention can execute the voice processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Reference is made to the description of any method embodiment of the invention for details not described in this embodiment.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 4, the electronic device 12 is in the form of a general purpose computing device. The components of the electronic device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. In the electronic device 12 of the present embodiment, the display 24 is not provided as a separate body but is embedded in the mirror surface, and the display surface of the display 24 and the mirror surface are visually integrated when the display surface of the display 24 is not displayed. Also, the electronic device 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, to implement a voice processing method provided in an embodiment of the present invention, wherein after playing guidance information, first voice information input by a user is obtained, wherein the guidance information is generated based on second voice information input by the user, if the first voice information matches with the guidance information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result.

The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements a voice processing method as provided by all the embodiments of the invention, wherein after playing guide information, first voice information input by a user is obtained, and the guide information is generated based on second voice information input by the user; if the first voice information is matched with the guide information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A speech processing method, comprising:

After playing the guidance information, obtaining first voice information input by the user, wherein the guidance information is generated based on second voice information input by the user;

If the first voice information matches the guide information, obtaining a first voice analysis result according to the first voice information, wherein the second voice information is associated with the first voice analysis result;

Playing first feedback information according to the first voice analysis result;

The method further comprises:

Before playing the guidance information, obtaining the second voice information input by the user and parsing the second voice information; if the parsing of the second voice information fails, generating and playing the guidance information according to the second voice information;

Generating the guide information includes:

Determine whether there is third voice information in the first database, wherein the similarity between the third voice information and the second voice information is greater than or equal to a preset threshold;

If the third voice information exists in the first database, the guide information is generated according to the third voice information; if the third voice information does not exist in the first database, whether the second voice information is chat information is determined;

If the second voice information is small talk information, the third feedback information is played, wherein the third feedback information is small talk words; if the second voice information is not small talk information, the fourth feedback information is played, wherein the fourth feedback information is used to indicate that the voice understanding fails.

2. The speech processing method according to claim 1, further comprising:

If the second voice information is parsed successfully, obtaining the second voice parsing result, and playing the second feedback information according to the second voice parsing result;

The second voice information is associated with the second voice analysis result and stored in the first database.

3. The speech processing method according to any one of claims 1 to 2, characterized in that it also includes:

The second voice information is associated with the first voice analysis result and stored in a second database.

4. The speech processing method according to claim 3 is characterized in that the first database and the second database are constructed in real time based on user interaction logs.

5. A speech processing device, comprising:

An acquisition module, used for acquiring first voice information input by a user after playing the guidance information, wherein the guidance information is generated based on second voice information input by the user;

a parsing module, configured to obtain a first voice parsing result according to the first voice information if the first voice information matches the guide information, wherein the second voice information is associated with the first voice parsing result;

A feedback module, configured to play first feedback information according to the first speech analysis result;

The acquisition module is further used to acquire the second voice information input by the user before playing the guidance information, and parse the second voice information;

The parsing module is further configured to generate and play the guiding information according to the second voice information if the parsing of the second voice information fails;

The parsing module is also used to determine whether there is a third voice information in the first database, wherein the similarity between the third voice information and the second voice information is greater than or equal to a preset threshold; if the third voice information exists in the first database, the guide information is generated according to the third voice information; if the third voice information does not exist in the first database, it is determined whether the second voice information is small talk information; if the second voice information is small talk information, third feedback information is played, wherein the third feedback information is small talk words; if the second voice information is not small talk information, fourth feedback information is played, wherein the fourth feedback information is used to indicate that voice understanding fails.

6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the speech processing method as claimed in any one of claims 1 to 4 when executing the program.

7. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the speech processing method according to any one of claims 1 to 4 is implemented.