CN118230768A - Voice quality inspection method and device, electronic equipment and storage medium - Google Patents
Voice quality inspection method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN118230768A CN118230768A CN202211641003.4A CN202211641003A CN118230768A CN 118230768 A CN118230768 A CN 118230768A CN 202211641003 A CN202211641003 A CN 202211641003A CN 118230768 A CN118230768 A CN 118230768A
- Authority
- CN
- China
- Prior art keywords
- voice
- training
- role
- frame
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
- 
        - G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
 
- 
        - G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
 
- 
        - G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
 
- 
        - G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
 
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application provides a voice quality inspection method, a device, an electronic device and a storage medium, wherein the method firstly identifies acoustic features of frames in original dialogue audio through a ringtone identification model to obtain ringtone frames and non-ringtone frames, obtains actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence, then identifies acoustic features of frames in the actual dialogue audio through a voice endpoint detection model to obtain effective dialogue audio, then separates each role in the effective dialogue audio through a speaker separation model to obtain conversation voice of each role, identifies conversation voice of each role through a voice identification model to obtain conversation text of each role, identifies conversation text of each role through a text classification model and a regular matching model to obtain target conversation text of each role, and finally performs quality inspection on the target conversation text. The application improves the quality and the acquisition efficiency of the conversation text of the staff.
    Description
Technical Field
      The present application relates to the field of computer technologies, and in particular, to a method and apparatus for voice quality inspection, an electronic device, and a storage medium.
    Background
      Many business scenarios involve telephone communication between a worker and a customer, in order to determine the call quality of the worker, pre-processing telephone records between the worker and the customer is required, identifying the time points when voices appear and disappear in a section of audio through a voice endpoint detection (vad) model, thereby determining effective voice fragments in original dialogue audio, inputting the effective voice fragments into a voice recognition system for text recognition, obtaining dialogue texts, performing role separation to obtain the dialogue texts of each role, and finally selecting the dialogue texts of the worker for quality inspection.
      Firstly, the current vad model mainly comprises two processing modes, namely short-time energy analysis, namely whether the voice is effective voice is judged by calculating the energy in the voice and setting a threshold value, and the processing mode is a neural network model. The former has small calculated amount, but can only ensure the distinguishing accuracy of silence and non-silence, and can not distinguish the noise with larger energy and human voice in the non-silence well, so that the overall accuracy is not high; the latter, although highly accurate, involves a large amount of computation, and there is still room for optimization for a specific application scenario. Therefore, the processing mode of the current vad model is difficult to obtain effective voice with higher accuracy, and the quality of the conversation text of the subsequent staff can be affected.
      In the actual scene of voice quality inspection, the one-way call usually also comprises voices before the call is connected, the voices are also effective voices, and most of the voices are work-independent words of staff, which do not belong to the quality inspection range, but cannot be distinguished from the effective voices after the call is connected through a vad model, so that useless voices are contained in the text, the quality of the text of the subsequent staff is low, and the quality inspection result is negatively influenced.
      Finally, the existing voice recognition system only divides the effective voices into N types according to the roles of the speakers, but the actual quality inspection process only concerns the speaking content of the staff, and then the staff needs to be manually distinguished to distinguish which type of effective voices belong to the staff, so that the workload is increased and the efficiency is lower.
      Therefore, the current voice quality inspection method has the technical problems that the conversation text quality of the staff is low and the efficiency of manually distinguishing the text of the staff is low, and needs to be improved.
    Disclosure of Invention
      The embodiment of the application provides a voice quality inspection method, a voice quality inspection device, electronic equipment and a storage medium, which are used for solving the technical problems that in the current voice quality inspection method, the quality of conversation text of staff is low and the efficiency of manually distinguishing the text of the staff is low.
      In order to solve the technical problems, the embodiment of the application provides the following technical scheme:
       the application provides a voice quality inspection method, which comprises the following steps: 
       identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, and obtaining actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence; 
       identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       Separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       Performing role recognition on the call text of each role through the text classification model and the regular matching model to obtain a target call text of the target role; 
       And performing quality inspection on the target call text. 
      Meanwhile, the embodiment of the application also provides a voice quality inspection device, which comprises:
       the first obtaining module is used for identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone judging model to obtain ringtone frames and non-ringtone frames, and obtaining the actual dialogue audio according to cut-off frames of a ringtone frame sequence and a non-ringtone frame sequence; 
       The second obtaining module is used for identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       The third obtaining module is used for separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       A fourth obtaining module, configured to identify the call text of each role through a text classification model and a regular matching model, so as to obtain a target call text of the target role; 
       And the quality inspection module is used for inspecting the quality of the target call text. 
      The application also provides an electronic device, which comprises a memory and a processor; the memory stores an application program, and the processor is configured to run the application program in the memory, so as to execute the steps in the voice quality inspection method described in any one of the above.
      An embodiment of the present application provides a computer readable storage medium, where a plurality of instructions are stored, where the instructions are adapted to be loaded by a processor to perform the steps in the above-mentioned voice quality inspection method.
      The beneficial effects are that: the application provides a voice quality inspection method, a device, an electronic device and a storage medium, wherein the method can accurately identify a ring frame and a non-ring frame by identifying the acoustic characteristics of each frame in original dialogue audio through a ring judgment model, further remove the audio of a ring part from the original dialogue audio, only keep the actual dialogue audio, so that all dialogues before the ring can not participate in subsequent quality inspection, and identify the acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model, and accurately identify an effective voice frame, further obtain the effective dialogue audio. That is, by adopting the voice quality inspection method, the quality and the acquisition efficiency of the conversation text of the staff can be improved.
    Drawings
      The technical solution and other advantageous effects of the present application will be made apparent by the following detailed description of the specific embodiments of the present application with reference to the accompanying drawings.
      Fig. 1 is an application scenario schematic diagram of a voice quality inspection method provided by an embodiment of the present application.
      Fig. 2 is a flow chart of a voice quality inspection method according to an embodiment of the present application.
      Fig. 3 is an overall architecture diagram of a voice quality inspection method according to an embodiment of the present application.
      Fig. 4 is a graph of a typical ring tone spectrum.
      Fig. 5 is a schematic diagram of a ring segment set according to an embodiment of the present application.
      Fig. 6 is a schematic diagram of a training process of a ringtone discrimination model in an embodiment of the present application.
      FIG. 7 is a diagram of a first training process of the vad model in an embodiment of the present application.
      Fig. 8 is a schematic illustration of labeling a fourth sound tag according to an embodiment of the present application.
      Fig. 9 is a schematic diagram of an active speech segment set in an embodiment of the present application.
      FIG. 10 is a diagram of a second training process of the vad model in an embodiment of the present application.
      FIG. 11 is a flowchart of a speaker separation model according to an embodiment of the present application.
      FIG. 12 is a schematic workflow diagram of a text classification model in an embodiment of the application.
      Fig. 13 is a schematic diagram of a voice quality inspection device according to an embodiment of the present application.
      Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
    Detailed Description
      The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
      The embodiment of the application provides a voice quality inspection method, a voice quality inspection device, electronic equipment and a computer readable storage medium, wherein the voice quality inspection device can be integrated in the electronic equipment, and the electronic equipment can be a server or a terminal and other equipment.
      Referring to fig. 1, fig. 1 is a schematic view of a scenario of an application of a voice quality inspection method provided by an embodiment of the present application, where the scenario may include a local server and/or a remote server, and a ring judgment model, a voice endpoint detection model, a speaker separation model, a voice recognition model, a text classification model, and a regular matching model are set in the server, where:
       After receiving the original dialogue audio, the server firstly recognizes acoustic features of frames in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, obtains actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence, then recognizes acoustic features of frames in the actual dialogue audio through a voice endpoint detection model to obtain effective voice frames, obtains the effective dialogue audio according to the effective voice frame sequences, separates each role in the effective dialogue audio through a speaker separation model to obtain conversation voice of each role, recognizes conversation voice of each role through a voice recognition model to obtain conversation text of each role, recognizes conversation text of each role through a text classification model and a regular matching model to obtain target conversation text of a target role, and provides the target conversation text to relevant quality inspectors, and the relevant quality inspectors can inspect quality of the target conversation text so as to measure whether the conversation voice of the target role meets the working requirements. 
      It should be noted that, the schematic system scenario shown in fig. 1 is only an example, and the servers and the scenarios described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the system and the appearance of the new service scenario, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems. The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
      Referring to fig. 2, fig. 2 is a flow chart of a voice quality inspection method according to an embodiment of the present application, where the method specifically includes:
       s1: and identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, and obtaining the actual dialogue audio according to the cut-off frames and the non-ringtone frame sequences of the ringtone frame sequences. 
      The original dialogue audio is audio obtained by recording the whole process of a communication dialogue between a worker and a customer by taking a telephone as a medium, and the dialogue at least comprises two roles of the worker and the customer, wherein each role can only comprise one person or two or more persons (for example, the customer has a play function, and a plurality of customers and a worker dialogue scene) and the worker can be various roles which need to be communicated with the customer by telephone, such as a manual customer service, an express delivery person, a takeaway delivery person and the like. Because the working of the staff needs to communicate with the clients through the telephone, the company needs to process the original dialogue audio generated by each call to obtain the call text of the staff, and then quality control is carried out on the specific content of the call text, and the working condition of the staff is measured according to the quality control result.
      As shown in fig. 3, for each original dialog audio, a preprocessing process, a voice recognition process, and a post-processing process are required to obtain a worker dialog text. Specifically, a ring judgment model, a voice endpoint detection model (vad model) and a speaker separation model are needed in the preprocessing process, a voice recognition model is needed in the voice recognition process, a regular matching model and a text classification model are needed in the post-processing process, and a final conversation text of the staff can be obtained through a series of processes. The entire flow in fig. 3 is specifically described below in connection with the steps.
      The original dialogue audio is recorded from the beginning of the dialing of the number by the staff, the audio of the ring tone is generated before the call is put through, and in the process, if the staff has some complaints, the audio is recorded, but as the voice is not known by the client, the final quality inspection only needs to be carried out on the audio after the call is put through, and the identification and the quality inspection on the audio before the call are not needed. In the preprocessing process, the original dialogue audio is firstly input into a ringtone distinguishing model, the ringtone distinguishing model identifies the audio before the call is put through and the audio after the call is put through from the original dialogue audio, removes the audio before the call is put through, and only retains the actual dialogue audio after the call is put through.
      Specifically, the ringtone discrimination model may be tdnn _ lstm model or tdnn _stats model, which can identify the acoustic characteristics of each original dialogue frame in the original dialogue audio, that is, map the waveform of each original dialogue frame into a multidimensional vector containing sound information, and determine whether each original dialogue frame is a ringtone frame or a non-ringtone frame according to the acoustic characteristics because the acoustic characteristics of the ringtone and the non-ringtone are different, wherein a plurality of continuous ringtone frames form a ringtone frame sequence, and a plurality of continuous non-ringtone frames form a non-ringtone frame sequence.
      The ringtone may be a continuous ringtone or a discontinuous ringtone, and for the former, when the staff does not sound during the ringtone, a ringtone sequence and a non-ringtone sequence can be formed, and when the staff sounds during the ringtone, a form of alternately arranging a plurality of ringtone sequences and a plurality of non-ringtone sequence frames can be formed; in the latter case, whether or not the worker makes a voice during the ringing, a form in which a plurality of ringing sequences and a plurality of non-ringing sequence frames are alternately arranged is formed. In either case, after the call is turned on, the ring must disappear, so that the cut-off frame, that is, the frame with the last time in all ring frames, can be determined from all ring sequences, all original dialogue frames before the cut-off frame (including the cut-off frame) are the audio before the call is turned on, all original dialogue frames after the cut-off frame are the audio after the call is turned on, the previous audio is removed, and the next audio is reserved and output, that is, the actual dialogue audio. It should be noted that the actual dialogue audio may have a longer duration, and due to the limitation of the processing duration, the actual dialogue audio may be segmented and output in the form of a plurality of actual dialogue segments, for example, 10 seconds as one segment, where each actual dialogue segment includes a plurality of continuous actual dialogue frames.
      Because the actual dialogue audio only contains dialogue condition after the call is put through, the influence of voice before the call is put through is eliminated, and staff can not be subjected to quality inspection problems even if complaints exist before the call is put through, so that the quality of quality inspection is ensured. In addition, because the audio before the call is removed, the data volume which needs to be processed later is simplified, the computing resource is saved, and the processing efficiency is improved.
      As shown in fig. 4, a telephone ring usually has specific spectral characteristics, the left graph in fig. 4 is a spectrum diagram of "beep" sound, the right graph in fig. 4 is a spectrum diagram of pure music, and there is a method for identifying the ring by the spectral characteristics in the related art, but the effect of identifying the ring is better only for the type of ring in the left graph, because the spectral characteristics of the ring are relatively fixed, the similarity between the actual spectrum diagram and the reference spectrum diagram can be directly compared to determine whether the ring is a ring, and when the frequency spectrum of the ring in the right graph is different, the frequency spectrum of the ring is also greatly different, and it is difficult to determine whether the ring is the ring from the spectral similarity alone.
      In the embodiment of the application, by setting the ring judgment model, whether each frame is a ring frame or not is judged by identifying the acoustic characteristic of each frame, and the judgment result obtained by the mode is accurate because the difference of the acoustic characteristics between the ring and various non-ring such as voice, cough, sigh, silence and the like of a person speaking is large, no matter which ring can be applicable, each frame can be accurately obtained, and the integrity of the actual dialogue audio is ensured to the greatest extent.
      In one embodiment, before S1, further comprising: acquiring a plurality of first training dialogue audios, wherein each first training dialogue audio comprises a ringtone training fragment and a non-ringtone training fragment; extracting acoustic characteristics of each first training frame in each first training dialogue audio, and labeling first sound labels of each first training frame, wherein the first sound labels comprise ringtones or non-ringtones; generating a first training data set according to the acoustic characteristics of each first training frame and the first sound tag in each first training dialogue audio; the ringtone discrimination model is trained based on the first training data set.
      Before the ringtone discrimination model is applied, the ringtone discrimination model needs to be trained. During training, a plurality of current original dialogue audios are firstly obtained, ringtone fragments in each original dialogue audio are segmented (if ringtone fragments exist), N ringtone fragments are obtained, non-ringtone audios (including silence, noise, human voice and the like) in each original dialogue audio are segmented, M non-ringtone fragments are obtained, and then the N ringtone fragments and the M non-ringtone fragments are traversed and combined pairwise to obtain N first training dialogue audios, so that each first training dialogue audio comprises a ringtone training fragment and a non-ringtone training fragment.
      In addition, to enhance the robustness of the model, data enhancement may be performed on each first training session audio, such as adding additive noise and reverberation using the open source dataset RIRS, so that the trained model may be used to identify ring frames and non-ring frames in noisy and reverberant environments.
      The acoustic features of each first training frame in each first training session audio are then extracted, which may be obtained using a Linear Predictive Cepstral Coefficient (LPCC) or Mel cepstral coefficient (MFCC) algorithm, in order to map the waveform of each first training frame into a multidimensional vector containing acoustic information, i.e. acoustic features. Meanwhile, a first sound label is added to each first training frame in a manual identification and labeling mode, if the first training frame is a ringtone frame, a ringtone label is labeled, and if the first training frame is a non-ringtone frame, a non-ringtone label is labeled. During manual identification, a ring fragment is identified for each first training dialogue audio, and then the fragments are framed and marked to obtain a first label of each first training frame. As shown in fig. 5, first, a set of ring segments is generated by collecting all manually identified ring segments, wherein 000001_01 in the set of ring segments is one ring segment, the start time and the end time of manual identification are respectively 0 seconds and 16.04 seconds, 000002_01 is another ring segment, and the start time and the end time of manual identification are respectively 0 seconds and 3.06 seconds, and so on. For each ring segment in the ring segment set, the related automation tool can be used for automatically framing and labeling ring labels, and for all segments outside the ring segment set, the ring segments are used as non-ring segments, and the automatic framing and labeling of the non-ring labels are also performed. In this way, the labeling efficiency can be improved.
      As shown in fig. 6, after extraction and labeling, each acoustic feature corresponds to a determined first sound tag, and an acoustic feature (training input) -first sound tag (training output) pair is formed, then a first training data set can be generated based on all the acoustic feature-first sound tag pairs, and the ring tone discrimination model is trained based on the first training data set until the recognition rate of the first sound tag corresponding to the acoustic feature reaches the expectation. After the ring judgment model is trained in the mode, the model can be used later to obtain higher ring judgment accuracy.
      S2: and identifying acoustic characteristics of each frame in the actual dialogue audio through the voice endpoint detection model, obtaining an effective voice frame, and obtaining the effective dialogue audio according to the effective voice frame sequence.
      The voice endpoint detection model, i.e. the vad model, is used for recognizing the occurrence time point and the disappearance time point of a certain type of voice required by the subsequent voice recognition stage, then extracting the audio of the type of voice according to the two time points, and removing other types of voice, thereby effectively screening data, reducing the subsequent processing workload and improving the recognition accuracy. In the whole actual dialogue audio, the sound types can be classified according to two different classification standards, namely effective voice and other voices, wherein the effective voice is voice of normal dialogue of each character, such as ' where you are ', ' I ' ten-point half-top get part ', and the like, the other voices are all voices except the effective voice, the ineffective voice is voice of abnormal dialogue of each character, such as cough, sigh, and the like, and the other voices comprise ineffective voice and silence. In the embodiment of the application, the voices required in the subsequent voice recognition stage are effective voices.
      Specifically, the speech end point detection model may be a tdnn _ lstm model or a tdnn _stats model, which can identify the acoustic characteristics of each actual dialog frame in the actual dialog audio, that is, map the waveform of each actual dialog frame to a multidimensional vector containing sound information, and determine whether each original dialog frame is valid speech according to the acoustic characteristics because the valid speech and other speech (including invalid speech and silence) have different acoustic characteristics. Several consecutive active speech frames form an active speech frame sequence, and all active speech frame sequences in the actual dialog audio form the active dialog audio. It should be noted that, the duration of the active dialogue audio may be longer, and the active dialogue audio may be segmented due to the limitation of the processing duration, and output by using a plurality of active dialogue segments, for example, 8 seconds as a segment, where each active dialogue segment includes a plurality of continuous active speech frames.
      The current vad model processing mode mainly comprises two modes, namely short-time energy analysis, namely judging whether the voice is effective voice by calculating the energy in the voice and setting a threshold value, and a neural network model. The former has small calculated amount, but can only ensure the distinguishing accuracy of silence and non-silence, and can not distinguish the noise with larger energy and human voice in the non-silence well, so that the overall accuracy is not high; the latter, although highly accurate, involves a large amount of computation, and there is still room for optimization for a specific application scenario. Therefore, the processing mode of the current vad model is difficult to obtain effective voice with higher accuracy, and the quality of the text of the subsequent staff can be affected.
      In the embodiment of the application, the vad model judges whether each frame is an effective voice frame by identifying the acoustic characteristics of each frame, and the judgment result obtained by the mode is more accurate and can be accurate to each frame because the difference of the acoustic characteristics between the effective voice and various other voices such as cough, sigh, silence and the like is larger, thereby maximally ensuring the integrity of effective dialogue audio and being beneficial to improving the quality of the text of subsequent staff.
      In one embodiment, before S1, further comprising: constructing a decoding diagram based on the trained acoustic model, the trained language model and the pronunciation dictionary; acquiring a plurality of second training dialogue audios, wherein each second training dialogue audio comprises an effective voice training segment, an ineffective voice training segment and a mute training segment; extracting acoustic features of each second training frame in each second training dialogue audio, and inputting the acoustic features into the decoding diagram to obtain second sound tags of each second training frame, wherein the second sound tags comprise effective voice, ineffective voice or silence; labeling a third sound tag of each second training frame, wherein the third sound tag comprises silence or non-silence, and correcting the second sound tag according to the third sound tag; generating a second training data set according to the acoustic characteristics of each second training frame in each second training dialogue audio and the corrected second sound label; the speech endpoint detection model is trained based on the second training dataset.
      In the above-described embodiment, it is mentioned that there may be two cases in classifying sound in actual dialogue audio, and this embodiment is described taking as an example classification into valid voice, invalid voice and silence. As shown in fig. 7, in this scenario, a decoding diagram is first acquired and trained, and the decoding diagram is used for coarse classification of the input acoustic features, that is, mapping the acoustic features into a text of a second acoustic label, and in the mapping process, the best path search is uniformly performed by means of a Weighted Finite State Transducer (WFST). Specifically, the decoding graph is composed of h.fst for mapping acoustic features to triphones, c.fst for mapping triphones to phonemes, l.fst for mapping phonemes to words, and g.fst for achieving word-to-grammar constrained word mapping, and WFST forms of acoustic models, context-dependent models, pronunciation dictionaries, and speech models, respectively. Firstly, respectively training an acoustic model and a language model through an original voice file and an original labeling text to enable the acoustic model and the language model to achieve respective mapping functions, and then fusing H.fst, C.fst, L.fst and G.fst to obtain a trained decoding graph HCLG.fst, wherein the decoding graph can be obtained and trained based on the existing Kaldi framework.
      After obtaining the decoding diagram, a plurality of second training dialogue audios are obtained, each second training dialogue audio contains an effective voice segment, an ineffective voice segment and a mute segment, then the acoustic characteristics of each second training frame in each second training dialogue audio are extracted, and the extraction can be obtained by adopting a Linear Prediction Cepstrum Coefficient (LPCC) or Mel cepstrum coefficient (MFCC) algorithm, so that the waveform of each second training frame is mapped into a multidimensional vector containing sound information, namely the acoustic characteristics. And then, inputting each acoustic feature into the decoding diagram obtained in the step, searching and determining the optimal path from the acoustic feature to the text of the second acoustic tag through four WFSTs, and finally outputting the second acoustic tag corresponding to each acoustic feature, wherein the second acoustic tag comprises effective voice, ineffective voice or silence. After classification, each second training frame gets a specific second sound tag.
      Meanwhile, a third sound label is added to each second training frame in a manual identification and labeling mode, if the second training frame is a mute frame, a mute label is labeled, and if the second training frame is a non-mute frame, a non-mute label is labeled. Similarly, during manual identification, a mute segment is first identified for each second training dialogue audio, and then the segments are framed and labeled to obtain a second label of each second training frame. And (3) generating a mute segment set by all manually identified mute segments, wherein for each mute segment in the mute segment set, related automation tools can be adopted to automatically frame and label mute labels, and for all segments outside the mute segment set, the mute segments are used as non-mute segments, and automatic frame division and label non-mute labels are carried out, so that the labeling efficiency is improved.
      And then, correcting the second sound label by using the third sound label of each second training frame, wherein if the second sound label obtained by classifying the decoding diagram of a certain second training frame is an invalid sound label and the third sound label obtained by manual identification is a mute label, the second sound label can be corrected to be the mute label. Because the reliability of the mute identification can be higher by manual work, the third sound tag is adopted for correction, so that the second sound tag corresponding to each acoustic feature can be more accurate.
      After extraction, labeling and correction, each acoustic feature corresponds to a determined corrected second acoustic signature, forming an acoustic feature (training input) -corrected second acoustic signature (training output) pair, and then generating a second training dataset based on all the acoustic feature-corrected second acoustic signature pairs, and training the vad model based on the second training dataset until the recognition rate of the second acoustic signature corresponding to the acoustic feature reaches the expectation. After the vad model is trained in the mode, the model can be used for obtaining higher effective voice recognition accuracy later, experimental data show that the voice recognition accuracy can reach 92%, and the recognition word error rate is only 22%.
      In one embodiment, before S1, further comprising: acquiring a plurality of third training dialogue audios, wherein each third training dialogue audio comprises an effective voice training segment and other voice training segments; extracting acoustic characteristics of each third training frame in each third training dialogue audio, and labeling a fourth sound label of each third training frame, wherein the fourth sound label comprises effective voice or other voices; generating a third training data set according to the acoustic characteristics of each third training frame and the fourth acoustic label in each third training dialogue audio; the speech endpoint detection model is trained based on the third training dataset.
      The method comprises the steps of firstly acquiring a plurality of third training dialogue audios, wherein each third training dialogue audio contains effective voice fragments and other voice fragments (including ineffective voice and silence), then extracting acoustic characteristics of each third training frame in each third training dialogue audio, wherein the acoustic characteristics can be obtained by adopting a Linear Prediction Cepstrum Coefficient (LPCC) or Mel cepstrum coefficient (MFCC) algorithm during extraction, and the purpose of the method is to map waveforms of each third training frame into a multidimensional vector containing sound information, namely acoustic characteristics.
      Meanwhile, a fourth voice label is added to each third training frame in a manual identification and labeling mode, if the effective voice frame is identified manually, an effective voice label is labeled, and if the effective voice frame is identified manually, other voice labels are labeled. Similarly, during manual recognition, an effective speech fragment is first recognized for each third training session audio, then the fragments are framed and labeled to obtain a fourth label of each third training frame, as shown in fig. 8, the speech fragment defined as "effective" during labeling is set as an effective speech fragment, the speech fragment defined as "ineffective" during labeling is set as other speech, a two-class label can be constructed, as shown in fig. 9, all the effective speech fragments can be recognized based on the two-class label, the effective speech fragment sets can be generated into an effective speech fragment set, the effective speech fragment set eng_2018_09_24_am_800s_nch_v2-00000000-00001080 is one effective speech fragment, the starting time and the ending time of manual recognition are respectively 0 second and 10.80 seconds, the effective speech fragment set eng_2018_09_24_am_nch_v2-00002355-00003702 is another effective speech fragment, and the starting time and the ending time of manual recognition are respectively 23.55 seconds and 37.02, and so on. For each effective voice segment in the effective voice segment set, related automation tools can be adopted to automatically frame and label the effective voice tags, and for all segments outside the effective voice segment set, other voice segments are used as other voice segments, and automatic frame division and other voice tags are also carried out, so that the labeling efficiency is improved.
      As shown in fig. 10, after extraction and labeling, each acoustic feature corresponds to a determined fourth sound tag, and an acoustic feature (training input) -fourth sound tag (training output) pair is formed, and then a third training data set can be generated based on all the acoustic feature-fourth sound tag pairs, and the vad model is trained based on the third training data set until the recognition rate of the fourth sound tag corresponding to the acoustic feature reaches the expectation. After the vad model is trained in the mode, the model can be used for obtaining higher effective voice recognition accuracy later, experimental data shows that the voice recognition accuracy can reach 97%, and the recognition word error rate is only 20%, compared with the three-label classification mode in the previous embodiment, the two-label classification mode in the embodiment is more suitable for practical scene application.
      In one embodiment, S2 specifically includes: identifying acoustic characteristics of each actual dialogue frame in the actual dialogue audio through a voice endpoint detection model to obtain initial voice tag probability of each actual dialogue frame, wherein the initial voice tag probability comprises initial effective voice probability and initial other voice probability; adjusting the initial sound tag probability of each actual dialogue frame based on a preset weight matrix to obtain target sound tag probability of each actual dialogue frame, wherein the target sound tag probability comprises target effective speech probability and target other speech probability; and comparing the target effective voice probability of each actual dialogue frame with the target other voice probabilities, and judging whether each actual dialogue frame is an effective voice frame or not according to the comparison result.
      When the speech end point detection model is tdnn _ lstm model, the output of each actual dialogue frame of the speech end point detection model in the above two-label classification scene is a two-classification label [ p1, p2], p1+p2=1, where p1 is an initial valid speech probability reflecting the probability that the frame is a valid speech, and p2 is an initial other speech probability reflecting the probability that the frame is other speech. In the actual application scenario, the environment is complex, for example, when the background is noisy or the voiceprint feature of the voice is not obvious, the effective voice is easy to be recognized as other voices such as mute or invalid voice, and the output is not accurate enough.
      In the embodiment of the present application, the foregoing two-class labels are fine-tuned by presetting a weight matrix a, where a is a weight matrix of 2×2, which may be expressed as a= [ [ a00, a01], [ a10, a11] ], where a00 represents a probability that the frame is valid voice in the entire actual dialog audio, a01 represents a probability that valid voice exists in other voices, a10 represents a probability that other voices exist in the valid voice, and a11 represents a probability that the frame is other voices in the entire actual dialog audio. After fine tuning, let the target effective speech probability of each actual dialog frame be p1 'and the target other speech probabilities be p2', then the following formula is satisfied:
      p1’=a00*p1+a01*p2
      p2’=a10*p1+a11*p2
       The probability values in the preset weight matrix A are set according to the difference degree between the environment of the training data application and the actual application environment, when the training data is applied to the scene of the conversation between the artificial customer service and the customer, the actual application environment is also used between the artificial customer service and the customer, the environment is the same, at this time, if n effective voice fragments and m other voice fragments participate in the training process, A= [ [ n/(n+m), 0], [0, M/(m+n) ] ] ], namely the values of a01 and a10 are all 0, at this time, p1 'is only related to the initial effective voice probability p1 and the probability a00 that the frame is effective voice in the whole actual conversation audio, and p2' is also only related to the initial other voice probability p2 and the probability a11 that the frame is other voice in the whole actual conversation audio. When the training data is applied to a scene of a conversation between the artificial customer service and the customer, and the actual application environment is a scene of a conversation between the express small business and the customer, because the artificial customer service is in a relatively quiet environment and the express small business is often in a relatively noisy environment, the two environments are different, at this time, the value of a10 can be increased to be different from 0, and p2 'is related to the initial other voice probability p2 and the probability a11 that the frame is other voice in the whole actual conversation audio, and is also related to the initial effective voice probability p1 and the probability a10 that other voice exists in the effective voice, compared with the situation that the training environment and the actual application environment are the same, the value of p2' is larger at this time, which means that the probability of being other voice frames is larger, and the frames are possibly other voice, but the environment is noisy and is recognized as effective voice by a voice endpoint detection model. 
      Through the mode, each probability value of the preset weight matrix A is flexibly set according to the difference between the training environment and the actual application environment, so that the recognition accuracy of the effective voice frame can be further improved. After p1 'and p2' are obtained, the sizes of the two are compared, if p1 'is larger, the frame is judged to be a valid voice frame, and if p2' is larger, the frame is judged to be other voice frames.
      S3: separating each role in the effective dialogue audio through a speaker separation model to obtain the talking voice of each role, and identifying the talking voice of each role through a voice identification model to obtain the talking text of each role.
      The speaker separation model adopts an i-vector model and a plda model, and performs role separation based on the principle of voiceprint segmentation and clustering. As shown in fig. 11, after the effective dialogue audio is obtained, the effective dialogue audio is input into a speaker separation model, voiceprint features are extracted in an i-vector model, scores are calculated in a plda model, call voices of each character are classified according to the scores, the clustering number X can be set according to the total number of characters during clustering, if a scene containing only one client and one client is included, X can be set as 2, and call voices of character 1 and call voices of character 2 are obtained after clustering.
      Through the processing of the speaker separation model, the call voice of each role is obtained, and then various call voices are respectively input into the voice recognition model for voice recognition, so as to obtain the call text of each role.
      S4: and performing role recognition on the call text of each role through the text classification model and the regular matching model to obtain a target call text of the target role.
      The above steps realize classification of call texts of each role, but it is not clear which role is a staff and which role is a client, and then the text classification model and the regular matching model are combined together to perform role recognition on the call texts, so as to determine a target role, namely a target call text of the staff. The regular matching model is used for matching the input text with a preset regular expression, judging whether the input text contains characters or character strings identical to those in the preset regular expression, classifying the input text according to the condition of containing or not, and the text classification model is used for extracting and clustering the characteristics of the input text, taking words or sentences with higher characteristic similarity as one class and outputting specific classes. The conversation between staff and clients in the communication process usually has normal conversation and non-normal conversation, the normal conversation can be matched through a regular matching model based on the difference of the working principles of the two models, each sentence of normal conversation in each role conversation text belongs to staff or clients, clustering processing is carried out on the non-normal conversation through a text classification model, each sentence of non-normal conversation in each role conversation text belongs to staff or clients, and then the two conversation texts are combined to determine which conversation text is the target conversation text of the staff.
      In one embodiment, S4 specifically includes: respectively carrying out regular matching on each call text and preset keywords of all roles through a regular matching model to obtain a role matching result of each call text, and scoring according to the role matching result and a first scoring rule to obtain a first score of each call text hitting each role; performing role classification on each call text through a text classification model to obtain a role classification result of each call text, and scoring according to the role classification result and a second scoring rule to obtain a second score of each call text hitting each role; and according to the first score and the second score, obtaining the comprehensive score of each call text hit in each role, and determining the call text with the highest comprehensive score of the hit target role as the target call text of the target role.
      Each character has a respective normative call operation, such as "you good, i am xx customer service", "you are very happy to serve", "you are happy to live", and the like, for clients "you do not need", "don't interest", "don't have time", "temporarily don't consider", and the like, preset keywords key_word_seat of the staff and preset keywords key_word_cust of the clients can be constructed according to the normative call operation, for the call text of the character 1 obtained in the above steps, each sentence in the call text is matched with the two types of preset keywords through a normative matching model, so that a character matching result of each sentence in the call text is obtained, a first scoring rule is obtained, such as scoring of a sentence hit worker to +3, scoring of each sentence in the whole call text is overlapped, and finally a first score of each hit character of the call text hit of the character 1 is obtained, such as scoring of the staff 9 and the client-3. For the call text of the character 2 obtained in the above steps, the series of operations is also executed, so as to obtain the first score of the call text of the character 2 for hitting each character, for example, hit the worker 0 score and hit the client-6 score.
      For irregular speaking, it is difficult to directly judge which role each sentence belongs to, then a text classification model is needed to extract feature vectors of each sentence in a call text, and a plurality of feature vectors with relatively close space distance are gathered into one class, for a scene only comprising one staff and one client, two classes can be obtained after clustering, and finally the text classification model outputs whether each sentence belongs to the speaking of the staff or the speaking of the client, i.e. whether the staff or the client is hit, so as to obtain the role classification result of each sentence in the call text. Meanwhile, a second scoring rule is obtained, and in order to reflect the final credibility, the scoring mechanisms of the first scoring rule and the second scoring rule can be different, for example, if a sentence hits a worker and scores +1, the hit customer scores-1. Inputting the call text of the role A into a text classification model, if a certain sentence in the call text is "get a part after me will get a part after the text classification model processing, scoring the sentence as +1, if a certain sentence in the call text is" get a part after the time of day can be delivered to a gate, and if the certain sentence in the call text hits a client after the text classification model processing, scoring the sentence as-1, stacking the scores of each sentence in the whole call text, and finally obtaining a second score of hitting each role of the call text of the role 1, such as hitting a 10 score of the worker and hitting a 1 score of the client. For the call text of the character 2 obtained in the above steps, the series of operations are also executed, so as to obtain a second score of each character hit by the call text of the character 2, for example, hit the worker 2 score and hit the client-8 score.
      Finally, for the call text of the role 1, the first score and the second score are integrated to obtain the integrated score 19 of the hit staff and the integrated score-4 of the hit client, and for the role 2, the first score and the second score are integrated to obtain the integrated score 2 of the hit staff and the integrated score-14 of the hit client. The call text hit staff of role 1 has the highest comprehensive score, indicating that role 1 is staff, and the call text hit client of role 2 has the highest comprehensive score, indicating that role 2 is client. Through the above process, the caller of the character 1 is finally determined to be the target call text.
      In one embodiment, the step of performing role classification on each call text through the text classification model to obtain a role classification result of each call text includes: respectively carrying out information classification on each call text through an information classification model to obtain useful texts containing useful information and useless texts not containing useful information in each call text; and performing role classification on each useful text through the role classification model to obtain a role classification result of each useful text.
      As shown in fig. 12, the text classification model includes an information classification model and a role classification model, taking a call text with an input text as a role 1 as an example, the information classification model judges whether each sentence in the input text contains useful information through feature extraction and text clustering, the useful information is information containing a single role identity, for example, if "i will get a part later" only if it is possible that the worker speaks, the useful information is contained in the sentence, if it is possible that the worker speaks, the useful information is not contained in the sentence, if it is possible that the worker speaks, and if it is possible that the client speaks, the useful information is not contained in the sentence, and the accuracy is reduced when the character recognition is performed on the sentence. If the information classification model recognizes that the sentence contains useful information, outputting 1, and continuously inputting the sentence as useful text into the character classification model, if the sentence does not contain useful information, outputting 0, and discarding the sentence as useless text. In the character classification model, through feature extraction and text clustering, whether each sentence hits a client or a worker is identified, and a classification result 0 or 1 is output.
      The information classification model and the character classification model are trained by adopting similar data, specifically, a batch of texts are used as first original training data, useful texts and useless texts are used as information classification labels to mark the first original training data, and then the information classification model is trained based on the marked first original training data until the accuracy of the information classification of the input texts by the model reaches expectations. And then, taking the useful texts as second original training data, marking the second original training data by taking staff and clients as character classification labels, and training a character classification model based on the marked second original training data until the accuracy of the model in classifying the characters of the input texts reaches the expectation.
      The text classification model obtained by the model training mode firstly identifies the useful text, then classifies the useful text by characters, and can further improve the accuracy compared with a mode of directly classifying the call text of each character.
      S5: and performing quality inspection on the target call text.
      After the target call text is obtained, the content can be inspected to know whether the relevant phone of the staff meets the relevant work specification of the company.
      According to the voice quality inspection method disclosed by the embodiment of the application, the acoustic characteristics of each frame in the original dialogue audio are identified through the ringtone identification model, the ringtone frames and the non-ringtone frames can be accurately identified, the audio of the ringtone part is further removed from the original dialogue audio, only the actual dialogue audio is reserved, all dialogues in front of the ringtone can not participate in subsequent quality inspection, the acoustic characteristics of each frame in the actual dialogue audio are identified through the voice endpoint detection model, the effective voice frames can be accurately identified, and then the effective dialogue audio is obtained, so that the effective dialogue audio in the actual dialogue process can be obtained in a simpler and more accurate mode through the acoustic characteristic identification of the model, the quality of the dialogue text is improved, the character identification is carried out by combining the text classification model and the regular matching model, the target dialogue text of the target character can be automatically obtained, the manual classification is not needed, and the combination classification accuracy of the two models is higher. That is, by adopting the voice quality inspection method, the quality and the acquisition efficiency of the conversation text of the staff can be improved.
      The method according to the above embodiment will be further described in terms of a voice quality inspection device, referring to fig. 13, the voice quality inspection device may include:
       the first obtaining module 10 is configured to identify acoustic features of each frame in the original dialogue audio through a ringtone discrimination model, obtain a ringtone frame and a non-ringtone frame, and obtain an actual dialogue audio according to a cut-off frame and a non-ringtone frame sequence of the ringtone frame sequence; 
       A second obtaining module 20, configured to identify acoustic features of each frame in the actual dialogue audio through a voice endpoint detection model, obtain an effective voice frame, and obtain the effective dialogue audio according to an effective voice frame sequence; 
       A third obtaining module 30, configured to separate each role in the effective dialogue audio through a speaker separation model to obtain a call voice of each role, and identify the call voice of each role through a voice identification model to obtain a call text of each role; 
       A fourth obtaining module 40, configured to identify the call text of each role through the text classification model and the regular matching model, so as to obtain a target call text of the target role; 
       And the quality inspection module 10 is used for inspecting the quality of the target call text. 
      In one embodiment, the voice quality inspection apparatus further comprises:
       The first acquisition module is used for acquiring a plurality of first training dialogue audios, and each first training dialogue audio comprises a ringtone training fragment and a non-ringtone training fragment; 
       the first extraction module is used for extracting acoustic characteristics of each first training frame in each first training dialogue audio and labeling a first sound label of each first training frame, wherein the first sound label comprises a ringtone or a non-ringtone; 
       The first generation module is used for generating a first training data set according to the acoustic characteristics of each first training frame and the first sound tag in each first training dialogue audio; 
       And the first training module is used for training the ringtone discrimination model based on the first training data set. 
      In one embodiment, the voice quality inspection apparatus further comprises:
       The construction module is used for constructing a decoding graph based on the trained acoustic model, the trained language model and the pronunciation dictionary; 
       The second acquisition module is used for acquiring a plurality of second training dialogue audios, and each second training dialogue audio comprises an effective voice training segment, an ineffective voice training segment and a mute training segment; 
       The second extraction module is used for extracting the acoustic characteristics of each second training frame in each second training dialogue audio, inputting the acoustic characteristics into the decoding graph and obtaining second sound labels of each second training frame, wherein the second sound labels comprise effective voice, ineffective voice or silence; 
       the first labeling module is used for labeling third sound labels of the second training frames, wherein the third sound labels comprise silence or non-silence, and the second sound labels are corrected according to the third sound labels; 
       The second generation module is used for generating a second training data set according to the acoustic characteristics of each second training frame in each second training dialogue audio and the corrected second sound label; 
       And the second training module is used for training the voice endpoint detection model based on the second training data set. 
      In one embodiment, the voice quality inspection apparatus further comprises:
       The third acquisition module is used for acquiring a plurality of third training dialogue audios, and each third training dialogue audio comprises an effective voice training segment and other voice training segments; 
       The third extraction module is used for extracting acoustic characteristics of each third training frame in each third training dialogue audio and labeling a fourth sound label of each third training frame, wherein the fourth sound label comprises effective voice or other voices; 
       the third generation module is used for generating a third training data set according to the acoustic characteristics of each third training frame and the fourth acoustic label in each third training dialogue audio; 
       and the third training module is used for training the voice endpoint detection model based on the third training data set. 
      In one embodiment, the second deriving module 20 comprises:
       The recognition sub-module is used for recognizing acoustic characteristics of each actual dialogue frame in the actual dialogue audio through a voice endpoint detection model to obtain initial voice tag probability of each actual dialogue frame, wherein the initial voice tag probability comprises initial effective voice probability and initial other voice probability; 
       The adjusting sub-module is used for adjusting the initial sound tag probability of each actual dialogue frame based on a preset weight matrix to obtain the target sound tag probability of each actual dialogue frame, wherein the target sound tag probability comprises target effective sound probability and target other sound probability; 
       And the judging sub-module is used for comparing the target effective voice probability of each actual dialogue frame with the target other voice probability, and judging whether each actual dialogue frame is an effective voice frame or not according to the comparison result. 
      In one embodiment, the fourth deriving module 40 comprises:
       The matching sub-module is used for respectively carrying out regular matching on each call text and preset keywords of all roles through a regular matching model to obtain a role matching result of each call text, and scoring according to the role matching result and a first scoring rule to obtain a first score of each call text hitting each role; 
       The classification sub-module is used for classifying the characters of each call text through a text classification model to obtain a character classification result of each call text, and scoring the characters according to the character classification result and a second scoring rule to obtain a second score of each call text hitting each character; 
       And the comprehensive sub-module is used for obtaining the comprehensive score of each role hit by each call text according to the first score and the second score, and determining the call text with the highest comprehensive score hit by the target role as the target call text of the target role. 
      In one embodiment, the text classification model includes an information classification model and a character classification model, and the classification submodule includes:
       the first classification unit is used for classifying information of each call text through the information classification model to obtain useful texts containing useful information and useless texts not containing useful information in each call text; 
       And the second classification unit is used for classifying the roles of each useful text through the role classification model respectively to obtain a role classification result of each useful text. 
      Compared with the prior art, the voice quality inspection device provided by the application can accurately identify the sound characteristics of each frame in the original dialogue audio through the ring judgment model, further remove the audio of the ring part from the original dialogue audio, only keep the actual dialogue audio, all dialogues in front of the ring can not participate in subsequent quality inspection, and identify the sound characteristics of each frame in the actual dialogue audio through the voice endpoint detection model, so that the effective voice frame can be accurately identified, further the effective dialogue audio can be obtained, therefore, the application can realize that the effective dialogue audio in the actual dialogue process can be obtained in a simpler and more accurate mode through the acoustic characteristic identification of the model, the quality of the dialogue text is improved, the target dialogue text of the target role can be automatically obtained by combining the text classification model and the regular matching model, the classification does not need to be manually performed, and the accuracy of combining the two models is higher. That is, by adopting the voice quality inspection device, the quality and the acquisition efficiency of the conversation text of the staff can be improved.
      Accordingly, an embodiment of the present application further provides an electronic device, as shown in fig. 14, where the electronic device may include a Radio Frequency (RF) circuit 101, a memory 102 including one or more computer readable storage media, an input unit 103, a display unit 104, a sensor 105, an audio circuit 106, a WiFi module 107, a processor 108 including one or more processing cores, and a power supply 109. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 14 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
       the radio frequency circuit 101 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, in particular, after receiving downlink information of the base station, the downlink information is processed by one or more processors 108; in addition, data relating to uplink is transmitted to the base station. The memory 102 may be used to store software programs and modules that the processor 108 executes to perform various functional applications and voice quality tests by running the software programs and modules stored in the memory 102. The input unit 103 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to customer settings and function control. 
      The display unit 104 may be used to display information entered by a client or provided to a client and various graphical client interfaces of a server, which may be composed of graphics, text, icons, video, and any combination thereof.
      The electronic device may also include at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Audio circuitry 106 includes speakers that may provide an audio interface between the client and the electronic device.
      WiFi belongs to a short-distance wireless transmission technology, and the electronic equipment can help clients to send and receive emails, browse webpages, follow-up streaming media and the like through the WiFi module 107, so that wireless broadband Internet follow-up is provided for the clients. Although fig. 14 shows the WiFi module 107, it is understood that it does not belong to the necessary constitution of the electronic device, and can be omitted entirely as required within a range that does not change the essence of the application.
      The processor 108 is a control center of the electronic device that uses various interfaces and lines to connect the various parts of the overall handset, performing various functions of the electronic device and processing the data by running or executing software programs and/or modules stored in the memory 102, and invoking data stored in the memory 102, thereby performing overall monitoring of the handset.
      The electronic device further comprises a power supply 109 (e.g. a battery) for powering the various components, which may preferably be logically connected to the processor 108 via a power management system, whereby charging, discharging, and power consumption management functions are performed by the power management system.
      Although not shown, the electronic device may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 108 in the server loads executable files corresponding to the processes of one or more application programs into the memory 102 according to the following instructions, and the processor 108 executes the application programs stored in the memory 102, so as to implement the following functions:
       identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, and obtaining actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence; 
       identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       Separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       Performing role recognition on the call text of each role through the text classification model and the regular matching model to obtain a target call text of the target role; 
       And performing quality inspection on the target call text. 
      In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of an embodiment that are not described in detail in the foregoing embodiments may be referred to in the foregoing detailed description, which is not repeated herein.
      Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
      To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the following functions:
       identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, and obtaining actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence; 
       identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       Separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       Performing role recognition on the call text of each role through the text classification model and the regular matching model to obtain a target call text of the target role; 
       And performing quality inspection on the target call text. 
      The foregoing describes in detail a voice quality inspection method, apparatus, electronic device and computer readable storage medium provided by the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the technical solution and core idea of the present application; those of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
    Claims (10)
1. A method for voice quality testing, comprising:
       identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone discrimination model to obtain ringtone frames and non-ringtone frames, and obtaining actual dialogue audio according to cut-off frames and non-ringtone frame sequences of a ringtone frame sequence; 
       identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       Separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       Performing role recognition on the call text of each role through the text classification model and the regular matching model to obtain a target call text of the target role; 
       And performing quality inspection on the target call text. 
    2. The method of claim 1, further comprising, prior to the step of identifying acoustic features of frames in the original dialog audio by a ringtone discrimination model:
       acquiring a plurality of first training dialogue audios, wherein each first training dialogue audio comprises a ringtone training fragment and a non-ringtone training fragment; 
       Extracting acoustic characteristics of each first training frame in each first training dialogue audio, and labeling a first sound label of each first training frame, wherein the first sound label comprises a ringtone or a non-ringtone; 
       generating a first training data set according to the acoustic characteristics of each first training frame and the first sound tag in each first training dialogue audio; 
       training the ringtone discrimination model based on the first training data set. 
    3. The method of claim 1, further comprising, prior to the step of identifying acoustic features of frames in the actual dialog audio by a speech endpoint detection model:
       Constructing a decoding diagram based on the trained acoustic model, the trained language model and the pronunciation dictionary; 
       Acquiring a plurality of second training dialogue audios, wherein each second training dialogue audio comprises an effective voice training segment, an ineffective voice training segment and a mute training segment; 
       extracting acoustic features of each second training frame in each second training dialogue audio, and inputting the acoustic features into the decoding graph to obtain second sound tags of each second training frame, wherein the second sound tags comprise effective voice, ineffective voice or silence; 
       Labeling a third sound tag of each second training frame, wherein the third sound tag comprises silence or non-silence, and correcting the second sound tag according to the third sound tag; 
       generating a second training data set according to the acoustic characteristics of each second training frame in each second training dialogue audio and the corrected second sound label; 
       Training the speech endpoint detection model based on the second training data set. 
    4. The method of claim 1, further comprising, prior to the step of identifying acoustic features of frames in the actual dialog audio by a speech endpoint detection model:
       Acquiring a plurality of third training dialogue audios, wherein each third training dialogue audio comprises an effective voice training segment and other voice training segments; 
       Extracting acoustic characteristics of each third training frame in each third training dialogue audio, and labeling a fourth sound label of each third training frame, wherein the fourth sound label comprises effective voice or other voices; 
       generating a third training data set according to the acoustic characteristics of each third training frame and a fourth acoustic label in each third training dialogue audio; 
       Training the speech endpoint detection model based on the third training data set. 
    5. The method of claim 4, wherein the step of identifying acoustic features of frames in the actual dialog audio by a speech endpoint detection model comprises:
       identifying acoustic characteristics of each actual dialogue frame in the actual dialogue audio through a voice endpoint detection model to obtain initial voice tag probability of each actual dialogue frame, wherein the initial voice tag probability comprises initial effective voice probability and initial other voice probability; 
       adjusting the initial sound tag probability of each actual dialogue frame based on a preset weight matrix to obtain target sound tag probability of each actual dialogue frame, wherein the target sound tag probability comprises target effective sound probability and target other sound probability; 
       And comparing the target effective voice probability of each actual dialogue frame with the target other voice probabilities, and judging whether each actual dialogue frame is an effective voice frame or not according to the comparison result. 
    6. The voice quality inspection method according to claim 1, wherein the step of performing character recognition on call text of each character through a text classification model and a regular matching model to obtain target call text of a target character comprises:
       Respectively carrying out regular matching on each call text and preset keywords of all roles through a regular matching model to obtain a role matching result of each call text, and scoring according to the role matching result and a first scoring rule to obtain a first score of each call text hitting each role; 
       Performing role classification on each call text through a text classification model to obtain a role classification result of each call text, and scoring according to the role classification result and a second scoring rule to obtain a second score of each call text hitting each role; 
       And according to the first score and the second score, obtaining the comprehensive score of each role hit by each call text, and determining the call text with the highest comprehensive score of the hit target role as the target call text of the target role. 
    7. The method of claim 6, wherein the text classification model includes an information classification model and a character classification model, and the step of classifying each of the call texts by the text classification model to obtain a character classification result of each of the call texts includes:
       Respectively carrying out information classification on each call text through the information classification model to obtain useful texts containing useful information and useless texts not containing useful information in each call text; 
       And carrying out role classification on each useful text through the role classification model to obtain a role classification result of each useful text. 
    8. A voice quality testing device, comprising:
       the first obtaining module is used for identifying acoustic characteristics of each frame in the original dialogue audio through a ringtone judging model to obtain ringtone frames and non-ringtone frames, and obtaining the actual dialogue audio according to cut-off frames of a ringtone frame sequence and a non-ringtone frame sequence; 
       The second obtaining module is used for identifying acoustic characteristics of each frame in the actual dialogue audio through a voice endpoint detection model to obtain an effective voice frame, and obtaining the effective dialogue audio according to an effective voice frame sequence; 
       The third obtaining module is used for separating each role in the effective dialogue audio through a speaker separation model to obtain the call voice of each role, and identifying the call voice of each role through a voice identification model to obtain the call text of each role; 
       A fourth obtaining module, configured to identify the call text of each role through a text classification model and a regular matching model, so as to obtain a target call text of the target role; 
       And the quality inspection module is used for inspecting the quality of the target call text. 
    9. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to run the application program in the memory to perform the steps in the voice quality inspection method according to any one of claims 1 to 7.
    10. A computer readable storage medium having stored thereon a computer program for execution by a processor to perform the steps in the voice quality inspection method of any of claims 1 to 7.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202211641003.4A CN118230768A (en) | 2022-12-19 | 2022-12-19 | Voice quality inspection method and device, electronic equipment and storage medium | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202211641003.4A CN118230768A (en) | 2022-12-19 | 2022-12-19 | Voice quality inspection method and device, electronic equipment and storage medium | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN118230768A true CN118230768A (en) | 2024-06-21 | 
Family
ID=91499988
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202211641003.4A Pending CN118230768A (en) | 2022-12-19 | 2022-12-19 | Voice quality inspection method and device, electronic equipment and storage medium | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN118230768A (en) | 
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN119724196A (en) * | 2024-12-03 | 2025-03-28 | 平安科技(深圳)有限公司 | A method, device, equipment and medium for separating roles based on voice | 
- 
        2022
        - 2022-12-19 CN CN202211641003.4A patent/CN118230768A/en active Pending
 
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN119724196A (en) * | 2024-12-03 | 2025-03-28 | 平安科技(深圳)有限公司 | A method, device, equipment and medium for separating roles based on voice | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN112804400B (en) | Customer service call voice quality inspection method and device, electronic equipment and storage medium | |
| US20240021202A1 (en) | Method and apparatus for recognizing voice, electronic device and medium | |
| WO2020228173A1 (en) | Illegal speech detection method, apparatus and device and computer-readable storage medium | |
| WO2024140434A1 (en) | Text classification method based on multi-modal knowledge graph, and device and storage medium | |
| CN112102850B (en) | Emotion recognition processing method and device, medium and electronic equipment | |
| CN111145733B (en) | Speech recognition method, speech recognition device, computer equipment and computer readable storage medium | |
| CN111128223A (en) | Text information-based auxiliary speaker separation method and related device | |
| CN108428447A (en) | A method and device for speech intent recognition | |
| CN110610707A (en) | Voice keyword recognition method and device, electronic equipment and storage medium | |
| CN113744742B (en) | Role identification method, device and system under dialogue scene | |
| CN112735385A (en) | Voice endpoint detection method and device, computer equipment and storage medium | |
| US9251808B2 (en) | Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof | |
| CN112309398B (en) | Method and device for monitoring working time, electronic equipment and storage medium | |
| US9799325B1 (en) | Methods and systems for identifying keywords in speech signal | |
| US11495234B2 (en) | Data mining apparatus, method and system for speech recognition using the same | |
| CN112614510B (en) | Audio quality assessment method and device | |
| CN112037772B (en) | Response obligation detection method, system and device based on multiple modes | |
| CN112331207B (en) | Service content monitoring method, device, electronic equipment and storage medium | |
| JP2015049254A (en) | Speech data recognition system and speech data recognition method | |
| CN113129895B (en) | Voice detection processing system | |
| CN112201275A (en) | Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium | |
| JPWO2020003413A1 (en) | Information processing equipment, control methods, and programs | |
| CN110956958A (en) | Searching method, searching device, terminal equipment and storage medium | |
| CN110853669B (en) | Audio identification method, device and equipment | |
| CN111986675A (en) | Voice dialogue method, device and computer readable storage medium | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |