Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1 is a schematic flow chart of a voice processing method according to an embodiment of the present invention, where the method according to the embodiment of the present invention is applicable to an intelligent question-answering scenario, and the method may be performed by a voice processing device according to an embodiment of the present invention, and the device may be implemented in a software and/or hardware manner. In a specific embodiment, the apparatus may be integrated in an electronic device, which may be a computer, a server, or the like. The following embodiments will be described taking the example of the integration of the apparatus in an electronic device, and referring to fig. 1, the method may specifically include the following steps:
step 101, after playing the guiding information, acquiring first voice information input by a user, wherein the guiding information is generated based on second voice information input by the user.
The second voice information input by the user can be voice information such as a voice instruction, a question or a chat sent by the user based on own needs. After receiving the second voice information sent by the user, the second voice information can be converted into text information through the voice recognition system, so that subsequent processing is facilitated. Of course, the user may also directly input the second voice information in text form through the device. After the second voice information is received, semantic analysis can be performed on the second voice information through a semantic understanding model, and further, guiding information is generated based on the second voice information input by the user. Further, the guiding information is played for the user, and meanwhile text information corresponding to the guiding information can be displayed for the user. After receiving the guiding information, the user can input the first voice information according to the guiding information, and further, the first voice information input by the user is obtained.
For example, the user wants to close the window by using the vehicle-mounted voice assistant, and the second voice information input by the user is "window is closed". After receiving the second voice information input by the user, the voice assistant performs semantic analysis on the second voice information and generates guide information for the second voice information according to the analysis result. Assuming that the generated guidance information is "close all windows", the guidance information is played for the customer that "you can say" close all windows "or" you are to close all windows ". The user may input a first user instruction, such as "close all windows" or "yes", based on the guidance information. Further, first voice information input by a user is acquired.
Step 102, if the first voice information is matched with the guiding information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result.
Specifically, after receiving first voice information input by a user, the first voice information is matched with the guide information. And if the first voice information is matched with the guide information, acquiring a first voice analysis result according to the first voice information. The analysis result of the first voice information is the analysis result of the guiding information, and the second voice information is associated with the first voice analysis result. For example, the second voice information input by the user is "all windows are closed", and the generated guidance information is "all windows are closed", and the first voice information input by the user is "all windows are closed". And obtaining that the first voice information input by the user is matched with the guide information through the semantic matching model, and knowing the first voice analysis result according to the guide information and the first voice information, wherein all windows of the user need to be closed. Further, it is known that, when the user inputs "window is closed", the analysis result is "all windows are closed" analysis result.
Step 103, playing the first feedback information according to the first voice analysis result.
Specifically, after the user inputs the second voice information, generating guide information for the user according to the second voice information, and waiting for the user to input the first voice information. After the user inputs the first voice information, if the first voice information is matched with the guide information, a first voice analysis result is generated according to the analysis results of the guide information and the first voice information, and the first voice analysis result is played or displayed for the user. And associating the second speech information with the first speech analysis result.
According to the technical scheme of the embodiment, after the guiding information is played, first voice information input by a user is obtained, if the first voice information is matched with the guiding information, a first voice analysis result is obtained according to the first voice information, wherein second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result. The method and the device can dynamically guide the user to input the voice information, so that the working efficiency of voice processing is improved, and the unintelligible voice information is associated with an understandable voice information result, so that online learning is realized, the accuracy of the voice processing result can be improved, and the user experience is further improved.
Fig. 2 is another flow chart of a voice processing method according to an embodiment of the present invention, and the steps of the voice processing method are refined based on the above embodiment. As shown in fig. 2, the method of this embodiment specifically includes the following steps:
step 201, obtaining second voice information input by a user, and analyzing the second voice information.
Specifically, user input of second voice information is received. And when the second voice information is voice, converting the voice information into text information. Of course, the user may also directly input text as the second voice information. After the second voice information is received, carrying out semantic analysis on the second voice information through a semantic understanding model.
Step 202, if the second voice information analysis fails, determining whether third voice information exists in the first database, wherein the similarity between the third voice information and the second voice information is greater than or equal to a preset threshold value. Further, if the second voice information analysis fails, determining whether third voice information exists in the first database. Specifically, the first database may be constructed in real time based on the user interaction log, and the first database stores guidance information that can be correctly parsed by the semantic understanding system. As shown in table 1 below:
TABLE 1
| Guidance information |
Analysis result |
| Closing sunshade curtain |
Intent is to close the sunshade curtain |
| Half of the skylight is closed |
The intention is that the skylight is controlled, the action is closed, and the value is 50 percent |
| Beijing tomorrow weather |
Intent is weather, place, beijing, time, tomorrow |
It should be noted that the guidance information and the analysis result shown in table 1 are merely examples, and do not constitute a final limitation of setting the actual guidance information and the analysis result, and in practical application, the relevant data may be adjusted according to actual needs, which is not specifically limited here.
Further, it is determined whether third voice information exists in the first database, wherein a similarity between the third voice information and the second voice information is greater than or equal to a preset threshold, as shown in the following table 2. It should be noted that the data shown in table 2 is only an example, and does not constitute a final limitation on the setting of actual data, and the relevant data may be adjusted according to actual needs in practical application, which is not specifically limited herein.
TABLE 2
| Second voice information |
Third voice information |
Similarity scoring |
| Screen window close |
Closing the sunshade curtain |
0.9800629019737244 |
| Screen window close |
Close window well |
0.5422488117218018 |
For example, when the second voice message is "screen close", the similarity threshold is set to 0.8, and the third voice message is a voice message having a similarity with the second voice message of 0.8 or more. It is determined whether third voice information is present in the first database. Assuming that the first database has a "blind closed" and the "blind closed" has a similarity to the "screen closed" of greater than 0.8, it can be determined that the third voice information is present in the first database. If a plurality of third voice information exists in the database 1, the third voice information is determined as the highest similarity threshold.
And setting different thresholds according to the second voice information of different types, and taking the information with the similarity with the second voice information being greater than or equal to a preset threshold as third voice information. When the second voice information input by the user is voice information of the type such as the query place, the similarity threshold can be adaptively adjusted to be high. For example, the second voice information is "navigate to a technology road", the first database stores "navigate to a technology road", and if the similarity threshold is set to be low, the "navigate to a technology road" is used as the third voice information, so that the customer experience is affected.
Step 203, if the third voice information exists in the first database, generating guiding information according to the third voice information.
After determining that the third voice information exists in the first database, reading the third voice information with the highest similarity value in the first database, analyzing the third voice information through a semantic understanding model, and generating guide information according to the third voice information. Further, the guiding information is played for the user to guide the user.
Step 204, after playing the guiding information, obtaining the first voice information input by the user.
After receiving the guiding information, the user can input the first voice information according to the guiding information, and further, the first voice information input by the user is obtained.
Step 205, if the first voice information matches the guiding information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result.
The analysis result of the first voice information is the analysis result of the guiding information, and the second voice information is associated with the first voice analysis result.
Specifically, after the user inputs the second voice information, generating guide information for the user according to the second voice information, and waiting for the user to input the first voice information. After the user inputs the first voice information, if the first voice information is matched with the guide information, a first voice analysis result is generated according to the analysis results of the guide information and the first voice information, and the first voice analysis result is played or displayed for the user.
In this embodiment, optionally, the second voice information is associated with the first voice analysis result and stored in the second database.
Specifically, when the first voice information is matched with the guide information, according to the first voice information, after a first voice analysis result is obtained, the analysis result of the first voice information is associated to a voice analysis result of the second information. And the second voice information and the first voice parsing result may be stored to a second database. The second database can be used as training corpus for upgrading iteration of the subsequent semantic understanding model.
The second voice information is associated with the first voice analysis result and stored in the second database, so that the subsequent semantic understanding model upgrading iteration is facilitated, the accuracy of the semantic analysis result can be improved, and the user experience is further improved
If the first voice information is not matched with the guiding information, the second voice information input by the user is acquired, and semantic analysis is performed on the second voice information through a semantic understanding model (step 201 is executed).
Step 206, playing the first feedback information according to the first voice analysis result.
By setting the similarity threshold to determine the third voice information and generating the guiding information according to the third voice information, the accuracy of the result fed back to the user can be improved, and the user experience is further improved.
Step 207, if the third voice information does not exist in the first database, determining whether the second voice information is boring information.
After determining that the third voice information does not exist in the first database, further determining whether the second voice information is boring information. For example, the second voice information input by the user is "i am boring" and "i am boring" no third voice information with similarity greater than a preset threshold value in the first database. The user can be played with the voice message of "whether boring".
And step 208, if the second voice information is chatting information, playing third feedback information, wherein the third feedback information is chatting.
If the second voice information input by the user is the boring information, the boring operation is played back for the user. Such as "that we chat bar" etc.
Step 209, if the second voice information is not boring information, playing fourth feedback information, where the fourth feedback information is used to indicate that voice understanding fails.
If it is determined that the second voice information input by the user is not boring information and the third voice information does not exist in the first database, feeding back information of voice understanding failure to the user. Such as "not understood" and the like.
By the steps, when the user does not adopt the guide information, the user needs are fully understood by adopting the exit strategy, the boring service is provided for the user, and the user experience is improved.
Step 210, if the second voice information analysis is successful, a second voice analysis result is obtained, and the second feedback information is played according to the second voice analysis result.
For example, the second voice information input by the user is "all windows are closed", the analysis result of the second voice information, which is intended to be "all windows are closed", is analyzed through the semantic understanding model ("all windows are closed" stored in the first database), and the semantic understanding model can obtain the analysis result through voice analysis). The feedback "good" is played for the user, all windows are closed for you.
Step 211, associating the second voice information with the second voice analysis result and storing the second voice information in the first database.
For example, the first database stores the analysis result of "close all windows". After the second voice information ("closing all windows") is successfully parsed, the second voice information is associated with the second voice parsing result and stored in the first database.
In this embodiment, optionally, the first database and the second database are constructed in real time based on the user interaction log.
By constructing the first database and the second database in real time based on the user interaction log, manual labeling is not needed, human resources are saved, and the working efficiency of the voice processing method can be further improved.
In the embodiment, second voice information input by a user is acquired and analyzed, if the second voice information is analyzed to fail, guide information is generated and played according to the second voice information, if the second voice information is analyzed to be successful, a second voice analysis result is acquired, second feedback information is played according to the second voice analysis result, and the second voice information and the second voice analysis result are associated and stored in the first database. The second voice information which cannot be understood is matched with the third voice information which can be understood, so that the second voice information input by a user is dynamically guided, the learning efficiency of a voice processing method is improved, the iterative learning of the voice processing method is realized, and the accuracy of a voice processing result is improved.
Fig. 3 is a block diagram of a speech processing device according to an embodiment of the present invention, where the device is adapted to execute the speech processing method according to the embodiment of the present invention. As shown in fig. 3, the apparatus may specifically include:
The obtaining module 301 is configured to obtain first voice information input by a user after playing guide information, where the guide information is generated based on second voice information input by the user;
The parsing module 302 is configured to obtain a first voice parsing result according to the first voice information if the first voice information is matched with the guiding information, where the second voice information is associated with the first voice parsing result;
and the feedback module 303 is configured to play the first feedback information according to the first voice analysis result.
Optionally, the obtaining module 301 is further configured to obtain the second voice information input by the user, and parse the second voice information;
The parsing module 302 is further configured to generate and play the guiding information according to the second voice information if the parsing of the second voice information fails.
Optionally, the parsing module 302 is specifically configured to determine whether third voice information exists in the first database, where a similarity between the third voice information and the second voice information is greater than or equal to a preset threshold, and if the third voice information exists in the first database, generate the guiding information according to the third voice information.
Optionally, the parsing module 302 is further configured to obtain a second voice parsing result if the second voice information is parsed successfully, play second feedback information according to the second voice parsing result, and associate the second voice information with the second voice parsing result and store the second voice information in the first database.
Optionally, the parsing module 302 is further configured to determine whether the second voice information is boring information if the third voice information does not exist in the first database, play third feedback information if the second voice information is boring information, where the third feedback information is boring, and play fourth feedback information if the second voice information is not boring, where the fourth feedback information is used to indicate that voice understanding fails.
Optionally, the parsing module 302 is further configured to associate and store the second voice information with the first voice parsing result to a second database.
Optionally, the first database and the second database are constructed in real time based on user interaction logs.
The voice processing device provided by the embodiment of the invention can execute the voice processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Reference is made to the description of any method embodiment of the invention for details not described in this embodiment.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 4, the electronic device 12 is in the form of a general purpose computing device. The components of the electronic device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. In the electronic device 12 of the present embodiment, the display 24 is not provided as a separate body but is embedded in the mirror surface, and the display surface of the display 24 and the mirror surface are visually integrated when the display surface of the display 24 is not displayed. Also, the electronic device 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, to implement a voice processing method provided in an embodiment of the present invention, wherein after playing guidance information, first voice information input by a user is obtained, wherein the guidance information is generated based on second voice information input by the user, if the first voice information matches with the guidance information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result.
The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements a voice processing method as provided by all the embodiments of the invention, wherein after playing guide information, first voice information input by a user is obtained, and the guide information is generated based on second voice information input by the user; if the first voice information is matched with the guide information, a first voice analysis result is obtained according to the first voice information, wherein the second voice information is associated with the first voice analysis result, and first feedback information is played according to the first voice analysis result. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.