[go: up one dir, main page]

CN118982978B - Intelligent evaluation method, system and medium for call voice - Google Patents

Intelligent evaluation method, system and medium for call voice Download PDF

Info

Publication number
CN118982978B
CN118982978B CN202411101107.5A CN202411101107A CN118982978B CN 118982978 B CN118982978 B CN 118982978B CN 202411101107 A CN202411101107 A CN 202411101107A CN 118982978 B CN118982978 B CN 118982978B
Authority
CN
China
Prior art keywords
voice
question
text
answer
response
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411101107.5A
Other languages
Chinese (zh)
Other versions
CN118982978A (en
Inventor
张明全
刘利峰
廖万飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sicheng Technology Co ltd
Original Assignee
Shenzhen Sicheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sicheng Technology Co ltd filed Critical Shenzhen Sicheng Technology Co ltd
Priority to CN202411101107.5A priority Critical patent/CN118982978B/en
Publication of CN118982978A publication Critical patent/CN118982978A/en
Application granted granted Critical
Publication of CN118982978B publication Critical patent/CN118982978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligent evaluation method, system and medium for call voice, which belong to the technical field of voice processing and comprise the steps of converting question-answer voice into standard voice based on voice quality, extracting question-answer text, outputting answer voice and corresponding answer text based on the question-answer text by a responder, storing call records to generate call paragraphs and call records, respectively extracting emotion words in the call paragraphs and response time distribution in the call records, generating question-answer scores based on the emotion words, generating response scores based on the response time distribution, retrieving reference text corresponding to question understanding characters in a response database, fusing all the reference text to generate abstract text, comparing the coincidence degree of the answer text and the abstract text, generating the answer scores based on the coincidence degree, and summarizing the question-answer scores, the response scores and the answer scores to generate call evaluation values of the call records. The invention can improve the accuracy and comprehensiveness of intelligent evaluation of the call voice.

Description

Intelligent evaluation method, system and medium for call voice
Technical Field
The invention belongs to the technical field of voice processing, and particularly relates to an intelligent evaluation method, system and medium for call voice.
Background
With the development of communication technology, evaluation of call voice quality becomes more and more important. The traditional voice quality assessment method mainly depends on manual hearing, which is low in efficiency and high in subjectivity, and is difficult to meet the requirements of large-scale and automatic assessment, so that the quality and effectiveness of the call voice are required to be assessed automatically, quickly and accurately.
A similar prior art has a Chinese patent application with publication number CN112735421A, discloses a real-time quality inspection method and device for voice call, and relates to the technical field of computers. The method comprises the steps of collecting voice media streams generated in real time in a voice communication process, carrying out voice recognition on the voice media streams to obtain corresponding text data, carrying out semantic analysis on the text data, and matching semantic analysis results with preset quality inspection rules to generate quality inspection results. According to the method, the real-time voice media stream is collected, voice recognition and semantic analysis processing are carried out on the voice media stream, and after regular matching is carried out, a quality inspection result can be generated in real time, so that the efficiency is high. In addition, a similar prior art also discloses a Chinese patent application with publication number of CN114220419A, which discloses a voice evaluation method, comprising the steps of obtaining voice to be evaluated and a reference text, extracting acoustic features from the voice, inputting the acoustic features into K sub-acoustic models of the acoustic models to obtain comprehensive state posterior probability, aligning phonemes of the voice with the reference text according to the comprehensive state posterior probability and a segmentation network constructed by the reference text, obtaining scoring features based on an alignment result, and inputting the scoring features into the scoring model to regress the scoring of the voice. Because the K sub-acoustic models are obtained through respective intra-set data and extra-set data training, the extra-set data of each sub-acoustic model are different, so that the acoustic models have differentiation on the extra-set data, the scoring accuracy of abnormal voices is further improved, the problem of unstable scoring of the abnormal voices is solved, and the method has higher reliability and usability.
However, in the above prior art, only the voice media stream is subjected to voice quality inspection, and the stability of the abnormal voice is scored, in actual situations, the talking voice is not only corrected in voice quality, but also the response satisfaction degree of the talking voice needs to be evaluated.
Disclosure of Invention
In order to solve the problems, the invention provides an intelligent evaluation method, an intelligent evaluation system and an intelligent evaluation medium for call voice, so as to solve the problems in the prior art.
In order to achieve the above object, the present invention provides an intelligent evaluation method for call voice, comprising:
s1, acquiring a question-answer voice of a question-answer person after a call starts, recognizing voice quality of the question-answer voice, converting the question-answer voice into standard voice based on the voice quality, extracting text characters of the standard voice and setting the text characters as a question-answer text;
s2, the responder outputs response voice based on the question-answer text, acquires a response text corresponding to the response voice, and stores a call record between the question-responder and the responder so as to generate a call paragraph and a call record;
S3, respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation records, generating question and answer scores based on the emotion words, and generating answer scores based on the response time distribution;
s4, acquiring a question understanding character corresponding to the question and answer text, constructing a response database, retrieving a reference text corresponding to the question and answer character in the response database, fusing all the reference texts to generate a summary text, comparing the coincidence degree of the response text and the summary text, and generating a response score based on the coincidence degree;
and S5, summarizing the question answer score, the answer score and the answer score, and generating a call evaluation value of the call record.
Further, the converting the question-answer speech to standard speech based on the speech quality includes the steps of:
Collecting a voice sample set, wherein the voice sample set comprises a plurality of normal voices and fuzzy voices, acquiring a voice waveform set corresponding to the voice sample set, extracting voice characteristics of all voice waveforms in the voice waveform set, comparing differences between the normal voice waveforms and the fuzzy voice waveforms in the voice waveform set based on the voice characteristics, generating a voice training model for recognizing the fuzzy voice waveforms based on the differences, inputting the voice waveform set into the voice training model, and pre-training the voice training model by using a self-supervision learning method to generate a voice conversion model;
Inputting the question-answer voice into the voice conversion model, predicting the pitch and intensity corresponding to the question-answer voice by the voice conversion model, generating a spectrogram based on the pitch and the intensity, reconstructing the spectrogram to generate a conversion waveform, generating the conversion voice based on a frequency domain conversion method, and setting the conversion waveform as the standard voice.
Further, generating the call evaluation value of the call record includes the following steps:
Setting emotion labels, wherein the emotion labels comprise pleasure labels, calm labels and clunk labels, establishing an emotion evaluation word bank based on the emotion labels, respectively acquiring semantic similarity between any word group in the conversation paragraph and the emotion evaluation word bank, extracting emotion words in the conversation paragraph based on the semantic similarity, acquiring the emotion labels corresponding to the emotion words and setting the emotion labels as emotion types;
Building a coordinate system based on the emotion types, accumulating the same quantity of the emotion types, drawing a score histogram in the coordinate system, obtaining an accumulated value of any emotion label in the score histogram, calculating a ratio of the sum of all accumulated values to the accumulated value corresponding to the clunk label, and setting the ratio as the question-answer score;
Dividing the response time distribution into an alternating current accumulated time length and a silencing accumulated time length based on the call recording waveform intensity, calculating the ratio of the total time length of the response time distribution to the silencing accumulated time length, and setting the ratio as the response score;
Calculating the response score F based on a first formula: Wherein I is a positive integer from 1 to I, I is the total number of the answer texts in the call record, N i is the number of the same characters contained between the ith answer text and the corresponding ith abstract text, N i is the total number of the characters in the ith abstract text, and the question-answer score, the answer score and the answer score are accumulated to generate the call evaluation value.
Further, the step of obtaining the question understanding character corresponding to the question answering text comprises the following steps:
Converting the question-answer text into a standard form text, constructing a knowledge dictionary, comparing the standard form text with semantic text stored in the knowledge dictionary, identifying a thing identifier and a relation identifier in the standard form text, setting the relation identifier as a root node, setting the thing identifier as a child node, connecting the child node corresponding to the thing identifier to the root node corresponding to the associated relation identifier, generating a tree structure based on the root node and the child node, and repeating the steps until all the root nodes and all the child nodes are incorporated into the tree structure;
And sequentially combining identifiers corresponding to all nodes in the tree structure based on the arrangement sequence of the relation identifiers in the standard form text, and generating the problem understanding character, wherein the nodes comprise the root node and the child nodes, and the identifiers comprise the object identifiers and the relation identifiers.
Further, the converting the question-answer text into standard form text comprises the following steps:
Setting a sentence standard word structure, acquiring a phrase and a corresponding word type contained in the question-answering text, acquiring a missing word type of the word type based on the sentence standard word structure, complementing a character corresponding to the missing word type based on the knowledge dictionary and a semantic matching method, defining the character as a complementary phrase, sequencing and combining the phrase and the complementary phrase based on the sentence standard word structure to generate a complementary sentence, and setting the complementary sentence as the standard form text.
Further, retrieving the reference text corresponding to the question understanding character in the response database includes:
The response database comprises question and answer information and response information, character similarity between any identifier in the question understanding characters and the question and answer information is respectively obtained, the question and answer information with the character similarity being larger than or equal to a first preset value is extracted and set as a key question phrase, response information corresponding to the key question phrase in the response database is obtained, and all the response information is combined and set as the reference text.
Further, the merging all the reference texts to generate abstract text comprises the following steps:
setting the question understanding character as a query text, sequentially combining all the reference texts based on the arrangement sequence of any identifier in the question understanding character, setting the reference text as a source text, and respectively splitting the query text and the source text into a plurality of word sequences;
Setting a binary cross entropy function as an objective function and constructing a abstract generation model, wherein the abstract generation model estimates the importance of the word sequence of the source text based on the word sequence of the query text, outputs the predicted value of the word sequence in the source text based on the importance, sets the word sequence with the predicted value being greater than or equal to a second preset value as an abstract sequence, sets abstract character length, and combines all the abstract sequences to generate the abstract text based on the abstract character length.
Further, whether the question-answering text belongs to the first question-answering is judged based on the time sequence, if not, the semantic relativity between the question-answering text and the question-answering text above is calculated, and if the semantic relativity is greater than or equal to a third preset value, the question-answering text and the abstract text corresponding to the question-answering text above are fused to generate a new question-answering text.
The invention also provides an intelligent evaluation system for the call voice, which is used for realizing the intelligent evaluation method for the call voice, and mainly comprises the following steps:
The voice acquisition module is used for acquiring a question-answer voice of a question-answer person after the call starts, recognizing the voice quality of the question-answer voice, converting the question-answer voice into standard voice based on the voice quality, extracting text characters of the standard voice and setting the text characters as a question-answer text;
The statistics module is used for enabling a responder to output response voice based on the question-answer text, obtaining a response text corresponding to the response voice and storing a call record between the question-answer and the responder so as to generate a call paragraph and a call record;
the analysis module is used for respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation records, generating question-answer scores based on the emotion words and generating response scores based on the response time distribution;
The evaluation output module is used for obtaining the question understanding characters corresponding to the question and answer texts, constructing a response database, searching the reference texts corresponding to the question understanding characters in the response database, fusing all the reference texts to generate abstract texts, comparing the coincidence degree of the response texts and the abstract texts, generating response scores based on the coincidence degree, and summarizing the question and answer scores, the response scores and the response scores to generate the call evaluation values of the call records.
The invention also provides a computer storage medium which stores program instructions, wherein the program instructions control equipment where the computer storage medium is located to execute the intelligent evaluation method for the talking voice when running.
Compared with the prior art, the invention has the following beneficial effects:
According to the invention, firstly, the question-answer voice is recognized and corrected, so that normal standard voice can be accurately generated, the accuracy of the call voice is improved, then, the abstract text corresponding to the question-answer text is intelligently generated as a reference answer by constructing a response database, the validity and the integrity of the answer text output by a responder are evaluated based on the abstract text, and finally, the call evaluation value of the call record is calculated by respectively counting the question-answer score, the answer score and the answer score, so that the accuracy of intelligent evaluation of the call voice can be improved.
The invention can also improve the rapidness of the call voice by generating the voice conversion model to accurately recognize the voice quality of the question-answering voice.
Drawings
FIG. 1 is a flow chart of steps of an intelligent evaluation method for call voice according to the present invention;
FIG. 2 is an exemplary diagram of generating a tree structure in accordance with the present invention;
Fig. 3 is a block diagram of an intelligent evaluation system for call voice according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of this disclosure.
As shown in fig. 1, an intelligent evaluation method for call voice includes:
S1, acquiring a question-answer voice of a question-answer person after a call starts, recognizing voice quality of the question-answer voice, converting the question-answer voice into standard voice based on the voice quality, extracting text characters of the standard voice and setting the text characters as a question-answer text.
Specifically, in the present embodiment, the talking voice refers to the voice dialogue content between the interrogator and the responder at the time of the counseling service, and the counseling service quality can be improved by performing intelligent evaluation on the talking voice. After the call starts, collecting the question-answering voice of the question-answering person, wherein the question-answering person refers to consultant, and the question-answering voice refers to voice transmitted by the question-answering person through mobile communication. The voice quality refers to the clarity of the question-answer voice, and in general, when a voice call is in progress, there may be a situation that the pronunciation is unclear or the communication signal is unstable, which results in a phenomenon that the voice quality is poor, so that the question-answer voice with poor voice quality needs to be corrected and converted into standard voice, i.e. voice with clear semantics. Text characters refer to character information converted from speech to text, and question-answering text can be generated through a natural language processing model.
And S2, outputting response voice by the respondent based on the question-answer text, acquiring the answer text corresponding to the response voice, and storing the call records between the question-answer and the respondent so as to generate call paragraphs and call records.
Specifically, in this embodiment, the responder refers to a customer service person, and combines all question-answer texts and corresponding answer texts in sequence based on a time sequence to generate a call paragraph, and combines all question-answer voices and corresponding answer voices in sequence to generate a call record. The time sequence refers to time characteristic information, the call paragraphs refer to characters sequentially formed by all question-answer texts and answer texts in call voices, and the call record refers to all call voices from the beginning to the end of the call, namely all question-answer voices and corresponding answer voices. By summarizing the call paragraphs and call records in the voice call, a basis can be provided for the subsequent evaluation of the call voice.
And S3, respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation record, generating question and answer scores based on the emotion words, and generating answer scores based on the response time distribution.
Specifically, in the present embodiment, emotion vocabularies are vocabularies for expressing emotion, including, but not limited to, more negative offensive vocabularies, more gentle general description vocabularies, and more positive adjective vocabularies. The response time distribution refers to the stop time between the question-answer voice and the response voice in the call record and the occupation time corresponding to the question-answer voice and the response voice. The emotion of the questioner can be described through emotion vocabulary to generate a question-answer score, and the response time distribution can describe response timeliness of the respondent to generate a response score.
And S4, acquiring the question understanding characters corresponding to the question and answer text, constructing a response database, searching the reference text corresponding to the question understanding characters in the response database, fusing all the reference text to generate a abstract text, comparing the coincidence degree of the response text and the abstract text, and generating a response score based on the coincidence degree.
Specifically, in this embodiment, the question understanding character refers to a character generated after semantic understanding analysis is performed on the question and answer text, the answer database refers to a database for reference answers, the reference text refers to a reference answer for searching for the question and answer understanding character in the answer database, the abstract text refers to an abstract character generated after fusion of all the reference texts corresponding to the question and answer characters, and simply understood that the abstract text is a brief and perfect reference answer character corresponding to the question and answer text. The coincidence degree refers to the similarity degree of characters of the reply text and the abstract text, and all the reply texts in the call records are counted to generate a reply score, wherein the reply score is used for evaluating the correctness of the reply voice of the responder.
And S5, summarizing the question answer score, the answer score and the answer score, and generating a call evaluation value of the call record.
Specifically, the question and answer scores, the answer scores and the answer scores are summarized, so that a call evaluation value can be generated, and the integrity and accuracy of intelligent evaluation of call voice can be improved.
As a preferred technical solution of the present invention, converting question-answer speech into standard speech based on speech quality includes the steps of:
Collecting a voice sample set, wherein the voice sample set comprises a plurality of normal voices and fuzzy voices, acquiring a voice waveform set corresponding to the voice sample set, extracting voice characteristics of all voice waveforms in the voice waveform set, comparing differences between the normal voice waveforms and the fuzzy voice waveforms in the voice waveform set based on the voice characteristics, generating a voice training model for recognizing the fuzzy voice waveforms based on the differences, inputting the voice waveform set into the voice training model, and pre-training the voice training model by using a self-supervision learning method to generate a voice conversion model.
Inputting the question-answering voice into a voice conversion model, predicting the pitch and intensity corresponding to the question-answering voice by the voice conversion model, generating a spectrogram based on the pitch and intensity, reconstructing the spectrogram to generate a conversion waveform, generating the conversion voice based on a frequency domain conversion method by the conversion waveform, and setting the conversion waveform as standard voice.
Specifically, in this embodiment, the voice sample set refers to a voice set of different voices, the normal voice refers to a voice with clear pronunciation standard, the fuzzy voice refers to a voice with relatively fuzzy pronunciation, the voice waveform set refers to a voice waveform set, wherein the voice waveform refers to a frequency waveform corresponding to each voice in the voice sample set obtained through digital audio processing, the voice feature refers to a series of parameters or attributes capable of representing voice characteristics extracted from the voice waveform, and the voice feature extraction can be performed through a convolutional neural network. The normal speech waveform refers to a speech waveform corresponding to normal speech, the blurred speech waveform refers to a speech waveform corresponding to blurred speech, and by comparing the normal speech waveform with speech features corresponding to the blurred speech waveform, differences between speech features can be generated, and types of differences include, but are not limited to, spectrum differences, noise differences, sharpness differences, prosodic feature differences, and the like. The voice training model is a pre-learning model, a mask language model can be generated through pre-learning by all differences, namely the mask language model is the voice training model, and all normal voices and fuzzy voices in the voice waveform set are input into the voice training model for training and learning. The self-supervised learning approach is a machine learning paradigm that allows models to learn from unlabeled data by generating pseudo tags using the structure or properties of the data itself and then training the model to predict these pseudo tags. The self-supervision learning method can enable the voice training model to conduct self-training by predicting the voice characteristics of the covered part, and the pre-trained voice training model is set to be a voice conversion model.
The voice conversion model can recognize and predict voice characteristics of the fuzzy voice, and further convert the fuzzy voice into normal voice. The question-answering voice is input into a voice conversion model, a spectrogram can be generated by predicting the pitch and the intensity of the question-answering voice, wherein the pitch is related to the frequency of the voice, the intensity refers to the loudness of the voice, the pitch and the intensity belong to voice characteristics, and the spectrogram is a spectral characteristic diagram obtained by carrying out Fourier transformation on the pitch and the intensity in a voice signal and extracting the voice characteristics by using a filter bank and can be used for visualizing and comparing the spectral characteristics of the question-answering voice. The converted waveform is a sound waveform generated by a spectrogram through a method of vocoding or spectral regeneration, the frequency domain conversion method is a signal processing method, and the time domain signal is converted into a frequency domain signal to generate a corresponding sound, namely converted voice. The converted voice is set as standard voice, so that question-answering voice with higher definition can be obtained. The spectral features of the question-answering voice can be effectively recognized and predicted through the voice conversion model generated through pre-training, and the accuracy and the integrity of generating standard voice are improved.
As a preferred technical solution of the present invention, generating a call evaluation value of a call record includes the following steps:
Setting an emotion label, wherein the emotion label comprises a pleasure label, a calm label and a clunk label, establishing an emotion evaluation word stock based on the emotion label, respectively acquiring semantic similarity between a conversation paragraph and any word group in the emotion evaluation word stock, extracting emotion words in the conversation paragraph based on the semantic similarity, acquiring the emotion label corresponding to the emotion words and setting the emotion label as an emotion type.
And building a coordinate system based on emotion types, accumulating the number of the same emotion types, drawing a score histogram in the coordinate system, obtaining the accumulated value of any emotion label in the score histogram, calculating the ratio of the sum of all the accumulated values to the accumulated value corresponding to the clunk label, and setting the ratio as a question-answer score.
Based on the intensity of the call recording waveform, the response time distribution is divided into an alternating current accumulated time length and a silencing accumulated time length, the ratio between the total time length of the response time distribution and the silencing accumulated time length is calculated, and the ratio is set as a response score.
Calculating a response score F based on a first formula: Wherein I is a positive integer from 1 to I, I is the total number of answer texts in the call record, N i is the number of characters contained between the ith answer text and the corresponding ith abstract text, N i is the total number of characters in the ith abstract text, and the question-answer score, the answer score and the answer score are accumulated to generate a call evaluation value.
Specifically, in the present embodiment, the emotion tags refer to tag information corresponding to various emotion types, wherein pleasant tags correspond to characters with vivid emotion, such as "thank", "fine" and "thank", etc., calm tags correspond to characters with mild emotion, such as "please ask", "fine", "hello" and "bye", and clumsy tags are used to represent characters with calm emotion, such as "overt", "quick point" and "annoyance", and illicit words, etc. The emotion evaluation word library is a database composed of characters corresponding to the same or similar meaning as the various emotion tags. The semantic similarity can be calculated through cosine similarity among word vectors, wherein the word vectors map word groups into a continuous vector space, so that the distance between the word groups with similar semantics in the space is closer, and therefore, emotion words with higher semantic similarity in conversation paragraphs can be obtained through setting a threshold value. The emotion labels corresponding to the phrases can be set as emotion types of the emotion vocabularies by acquiring the phrases with high semantic similarity in the emotion vocabularies and the emotion evaluation lexicon. For example, the semantic similarity between the emotion vocabulary "thank you" and the character "thank you" in the pleasure label is 0.9, and the emotion type of the emotion vocabulary is the pleasure label.
The coordinate system is a histogram coordinate system, wherein the horizontal axis is an emotion label, and the vertical axis is the number of emotion words contained in the same emotion type, so that a score histogram can be drawn, and the score histogram can clearly describe the frequency distribution of all emotion words in a conversation paragraph. The accumulated value corresponding to the clunk tag refers to the number of emotion vocabularies of which the emotion type belongs to the clunk tag in the conversation paragraph. For example, the total number of emotion words in a conversation paragraph is 10, wherein the number of emotion words belonging to the clunk label is 2, and the question-answer score is 10/2=5.
The communication accumulated time length refers to the time length of the voice with larger change of the waveform intensity and amplitude in the call recording, the silencing accumulated time length refers to the time length of the lower change of the waveform intensity and amplitude in the call recording, and the response reaction time of the respondent in the call recording and the question-answering time of the question-answering person can be obtained through response time distribution. For example, a response time profile with a total duration of 10 and a cumulative time of silence of 1 gives a response score of 10/1=10.
In the first formula, the coincidence degree between each reply text and the reference text in the call record is summarized, for example, the call record contains 2 reply texts, wherein the coincidence degree of the reply text L1 and the corresponding abstract text is (8/10) =0.8, the coincidence degree of the reply text L2 and the corresponding abstract text is (2/5) =0.4, and the reply score is (0.8+0.4) =1.2. Therefore, the call evaluation value is 5+10+1.2=16.2. The higher the call evaluation value, the higher the effectiveness of the call voice.
As a preferred technical scheme of the invention, the method for acquiring the question understanding character corresponding to the question answering text comprises the following steps:
Converting the question-answer text into a standard form text, constructing a knowledge dictionary, comparing the standard form text with semantic text stored in the knowledge dictionary, identifying object identifiers and relation identifiers in the standard form text, setting the relation identifiers as root nodes, setting the object identifiers as child nodes, connecting the child nodes corresponding to the object identifiers to the root nodes corresponding to the associated relation identifiers, generating a tree structure based on the root nodes and the child nodes, and repeating the steps until all the root nodes and all the child nodes are incorporated into the tree structure.
And sequentially combining identifiers corresponding to all nodes in the tree structure based on the arrangement sequence of the relation identifiers in the standard form text to generate a problem understanding character, wherein the nodes comprise root nodes and child nodes, and the identifiers comprise object identifiers and relation identifiers.
Specifically, in this embodiment, the standard form text refers to a preset form of character generated after adjusting or supplementing the word sequence of the question-answer text, for example, the preset form is "subject+object+predicate", so how to convert the question-answer text "a application software into the standard form text" what you use the method of the a application software "and a specific conversion method is described later, and therefore, the question-answer text can be more complete through the standard form text. The knowledge dictionary refers to a dictionary set for describing part-of-speech features of various phrases and is used for storing semantic texts of the various phrases, wherein the semantic texts refer to word meaning interpretations of the phrases, for example, the part-of-speech features of a method are nouns, and the corresponding semantic texts refer to ways or modes of solving problems. The object identifier and the relation identifier are elements used for semantic text, the object identifier represents a specific concept or entity and has uniqueness, and the relation identifier is used for describing the relation or attribute between the two object identifiers, for example, in the standard form text of ' what you use A application software ' is, the object identifiers are ' you ', ' A application software ' and ' method ', and the relation identifiers are ' use ' and ' what ' is '. The tree structure is composed of root nodes and child nodes, and is constructed by using object identifiers and relation identifiers, wherein each object identifier is a node, and the relation identifiers describe and define the connection modes among the nodes. This step is repeatedly performed, and all things identifiers and relations identifiers in the standard form text can be incorporated into the tree structure in a manner of setting child nodes or root nodes.
As shown in fig. 2, the tree structure is composed of a plurality of root nodes and child nodes in sequence. The child nodes of the bottom layer in the tree structure are sequentially connected with the root node and are sequentially combined with the child nodes of other layers and the root node to generate problem understanding characters, for example, the arrangement sequence is sequentially combined into 'you' to 'use' to 'A application software' to 'what is (what) to' method ', so that the question answering text' how the A application software is used 'the corresponding problem understanding characters are' what is (what) the A application software is used by 'you'.
As a preferred technical scheme of the present invention, the conversion of question-answer text into standard form text includes the following steps:
Setting a sentence standard word structure, acquiring a phrase and a corresponding word type contained in a question-answer text, acquiring a missing word type of the word type based on the sentence standard word structure, complementing a character corresponding to the missing word type based on a knowledge dictionary and a semantic matching method, defining the character as a complementary phrase, sequencing and combining the phrase and the complementary phrase based on the sentence standard word structure to generate a complementary sentence, and setting the complementary sentence as a standard form text.
Specifically, in the present embodiment, the sentence standard word structure is a set of rules for constructing and organizing language elements (such as words, phrases, sentences), which determine how to convert question-answer text into semantic text, and how to express complex semantic relationships in the semantic text. For example, the syntax structure in the sentence standard word structure may be set to "subject+object+predicate" corresponding to the preset form. The word type refers to the component types of the word group, including but not limited to subjects, predicates, objects and the like, the word group refers to a plurality of split words in the question-answer text, the missing word type refers to a word type lacking when the word type is compared with a sentence standard word structure, the semantic matching method refers to matching the components in the question-answer text with data in a knowledge dictionary to determine the missing word type, and generating the word group which complements the question-answer text, namely the complementary word group, in the knowledge dictionary, wherein the semantic matching method is realized by a machine learning algorithm, and for example, the semantic matching method is set by using a BERT model (Bidirectional Encoder Representations from Transformers). The completion sentence is a text with the same word type as the sentence standard word structure and a text which is most reasonable in terms of semantics with the question-answer text. Thus, the completion statement is set to a standard form text.
As a preferred embodiment of the present invention, retrieving a reference text corresponding to a question understanding character in a response database includes:
The response database comprises question and answer information and response information, character similarity between any identifier in the question understanding characters and the question and answer information is respectively obtained, the question and answer information with the character similarity being larger than or equal to a first preset value is extracted and set as a key question phrase, response information corresponding to the key question phrase in the response database is obtained, and all the response information is combined and set as a reference text.
Specifically, in the present embodiment, question-answer information refers to tag information of a subject of a question, answer information refers to answer text corresponding to the question-answer information, and an answer database is constructed from the question-answer information and the answer information. The character similarity is used for comparing the similarity degree of the two groups of characters, and the character similarity P can be calculated through a second formula, wherein the second formula is as follows: Where M is the number of the same characters in any identifier and question-answer information, M is the total number of characters of any identifier, for example, the identifier is "a application software", the question-answer information is "application software", and the character similarity is (4/5) =0.8. The first preset value is a preset numerical value, question-answer information which is larger than or equal to the first preset value is set as a key question phrase, the question-answer information is indicated to be an alternative reference question, corresponding answer information is an alternative reference answer, all answer information is combined, a complete reference text can be generated, the reference text contains all answer information which possibly exists in the question-answer text, and meanwhile, wider and insufficient simplified answer information possibly exists, so that fusion processing is needed to be carried out on the reference text to generate a abstract text.
As a preferred technical scheme of the invention, all the reference texts are fused to generate abstract texts:
The method comprises the steps of setting a question understanding character as a query text, sequentially combining all reference texts based on the arrangement sequence of any identifier in the question understanding character, setting the reference text as a source text, and respectively splitting the query text and the source text into a plurality of word sequences.
Setting a binary cross entropy function as an objective function and constructing a digest generation model, wherein the digest generation model estimates the importance of the word sequence of the source text based on the word sequence of the query text, outputs a predicted value of the word sequence in the source text based on the importance, sets the word sequence with the predicted value being greater than or equal to a second preset value as a digest sequence, sets the length of digest characters, and combines all the digest sequences to generate the digest text based on the length of the digest characters.
Specifically, in the present embodiment, first, a plurality of source texts in sentence form are generated by sequentially combining the corresponding reference texts by the arrangement order of the respective identifiers in the question understanding character. The word sequence represents a sequence in which the phrase or the character to which the word corresponds exists in the location feature.
The binary cross entropy function is a loss function commonly used for classification problems, particularly in the classification problems, when a digest generation model is constructed, the binary cross entropy can be used as an objective function to train the model, so that the difference between the digest generated by the model and the real digest can be measured, and the difference between the generated digest and the real digest can be minimized. The abstract generation model is a neural network model constructed by an objective function, outputs a one-dimensional word selection score for each word sequence in the source text, and sets the word selection score as the importance of the word sequence. And simultaneously, the method is used for predicting the probability of each word sequence in the source text to be contained in the abstract, and the probability is set as a predicted value of the corresponding word sequence. And extracting word sequences with predicted values larger than or equal to a second preset value, and outputting corresponding abstract sequences to reduce unnecessary word sequences. The abstract character length refers to the total number of characters of the preset abstract text, and the abstract text can be simplified by setting the abstract character length.
As a preferable technical scheme of the invention, whether the question-answering text belongs to the first question-answering is judged based on the time sequence, if not, the semantic relativity between the question-answering text and the question-answering text above the question-answering text is calculated, and if the semantic relativity is more than or equal to a third preset value, the question-answering text and the abstract text corresponding to the question-answering text above the question-answering text are fused to generate a new question-answering text.
Specifically, in this embodiment, after the call is started, the order times of the question-answering voice and the answer voice are recorded, respectively, and the first question-answering text of the question-answering person is set as the first question-answering. And judging the semantic relativity between the current question-answer text and the last question-answer text, wherein the semantic relativity refers to the calculation of the character similarity, and if the semantic relativity of two adjacent question-answer texts is greater than or equal to a third preset value, the fact that the question-answer text and the last question-answer text are higher in relativity is indicated, so that the abstract text corresponding to the last question-answer text can be fused with the question-answer text to generate a new question-answer text, and the completion of the question-answer text is facilitated. For example, the question-answering text is "B application software woolen", the corresponding last question-answering text is "how A application software is used", the abstract text corresponding to the last question-answering text is fused, and the question-answering text "B application software woolen" can be modified into a new question-answering text "how B application software is used". The relevance of the question and answer voices in the voice call can be realized, and the call effectiveness is improved.
As shown in fig. 3, the present invention further provides an intelligent evaluation system for call voice, where the system is configured to implement the above-mentioned intelligent evaluation method for call voice, and the system mainly includes:
The voice acquisition module acquires the question-answer voice of the question-answer person after the call starts, recognizes the voice quality of the question-answer voice, converts the question-answer voice into standard voice based on the voice quality, extracts text characters of the standard voice and sets the text characters as the question-answer text.
And the statistics module is used for outputting response voice by the respondent based on the question-answer text, acquiring the answer text corresponding to the response voice and storing the call records between the question-answer and the respondent so as to generate call paragraphs and call records.
And the analysis module is used for respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation record, generating question and answer scores based on the emotion words and generating answer scores based on the response time distribution.
The evaluation output module is used for obtaining the question understanding characters corresponding to the question and answer text, constructing a response database, searching the reference text corresponding to the question understanding characters in the response database, merging all the reference text to generate a abstract text, comparing the coincidence degree of the answer text and the abstract text, generating a answer score based on the coincidence degree, and summarizing the question and answer score, the answer score and the answer score to generate a call evaluation value of the call record.
The invention also provides a computer storage medium which stores program instructions, wherein the equipment where the computer storage medium is located is controlled to execute the intelligent evaluation method for the talking voice when the program instructions run.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a non-transitory computer readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, they should be considered as the scope of the disclosure as long as there is no contradiction between the combinations of the technical features.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (9)

1. An intelligent evaluation method for call voice is characterized by comprising the following steps:
s1, acquiring a question-answer voice of a question-answer person after a call starts, recognizing voice quality of the question-answer voice, converting the question-answer voice into standard voice based on the voice quality, extracting text characters of the standard voice and setting the text characters as a question-answer text;
Wherein said converting said question-answer speech to standard speech based on said speech quality comprises the steps of:
Collecting a voice sample set, wherein the voice sample set comprises a plurality of normal voices and fuzzy voices, acquiring a voice waveform set corresponding to the voice sample set, extracting voice characteristics of all voice waveforms in the voice waveform set, comparing differences between the normal voice waveforms and the fuzzy voice waveforms in the voice waveform set based on the voice characteristics, generating a voice training model for recognizing the fuzzy voice waveforms based on the differences, inputting the voice waveform set into the voice training model, and pre-training the voice training model by using a self-supervision learning method to generate a voice conversion model;
Inputting the question-answering voice into the voice conversion model, predicting the pitch and intensity corresponding to the question-answering voice by the voice conversion model, generating a spectrogram based on the pitch and the intensity, reconstructing the spectrogram to generate a conversion waveform, generating the conversion voice based on a frequency domain conversion method by the conversion waveform, and setting the conversion waveform as the standard voice;
s2, the responder outputs response voice based on the question-answer text, acquires a response text corresponding to the response voice, and stores a call record between the question-responder and the responder so as to generate a call paragraph and a call record;
S3, respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation records, generating question and answer scores based on the emotion words, and generating answer scores based on the response time distribution;
s4, acquiring a question understanding character corresponding to the question and answer text, constructing a response database, retrieving a reference text corresponding to the question and answer character in the response database, fusing all the reference texts to generate a summary text, comparing the coincidence degree of the response text and the summary text, and generating a response score based on the coincidence degree;
and S5, summarizing the question answer score, the answer score and the answer score, and generating a call evaluation value of the call record.
2. The method of claim 1, wherein generating the call evaluation value for the call record comprises:
Setting emotion labels, wherein the emotion labels comprise pleasure labels, calm labels and clunk labels, establishing an emotion evaluation word bank based on the emotion labels, respectively acquiring semantic similarity between any word group in the conversation paragraph and the emotion evaluation word bank, extracting emotion words in the conversation paragraph based on the semantic similarity, acquiring the emotion labels corresponding to the emotion words and setting the emotion labels as emotion types;
Building a coordinate system based on the emotion types, accumulating the same quantity of the emotion types, drawing a score histogram in the coordinate system, obtaining an accumulated value of any emotion label in the score histogram, calculating a ratio of the sum of all accumulated values to the accumulated value corresponding to the clunk label, and setting the ratio as the question-answer score;
Dividing the response time distribution into an alternating current accumulated time length and a silencing accumulated time length based on the call recording waveform intensity, calculating the ratio of the total time length of the response time distribution to the silencing accumulated time length, and setting the ratio as the response score;
Calculating the response score F based on a first formula: Wherein I is a positive integer from 1 to I, I is the total number of the answer texts in the call record, N i is the number of the same characters contained between the ith answer text and the corresponding ith abstract text, N i is the total number of the characters in the ith abstract text, and the question-answer score, the answer score and the answer score are accumulated to generate the call evaluation value.
3. The method of claim 1, wherein the obtaining the question understanding character corresponding to the question answering text comprises the steps of:
Converting the question-answer text into a standard form text, constructing a knowledge dictionary, comparing the standard form text with semantic text stored in the knowledge dictionary, identifying a thing identifier and a relation identifier in the standard form text, setting the relation identifier as a root node, setting the thing identifier as a child node, connecting the child node corresponding to the thing identifier to the root node corresponding to the associated relation identifier, generating a tree structure based on the root node and the child node, and repeating the steps until all the root nodes and all the child nodes are incorporated into the tree structure;
And sequentially combining identifiers corresponding to all nodes in the tree structure based on the arrangement sequence of the relation identifiers in the standard form text, and generating the problem understanding character, wherein the nodes comprise the root node and the child nodes, and the identifiers comprise the object identifiers and the relation identifiers.
4. A method according to claim 3, wherein said converting said question-answer text into standard form text comprises the steps of:
Setting a sentence standard word structure, acquiring a phrase and a corresponding word type contained in the question-answering text, acquiring a missing word type of the word type based on the sentence standard word structure, complementing a character corresponding to the missing word type based on the knowledge dictionary and a semantic matching method, defining the character as a complementary phrase, sequencing and combining the phrase and the complementary phrase based on the sentence standard word structure to generate a complementary sentence, and setting the complementary sentence as the standard form text.
5. A method according to claim 3, wherein retrieving the reference text corresponding to the question understanding character in the response database comprises:
The response database comprises question and answer information and response information, character similarity between any identifier in the question understanding characters and the question and answer information is respectively obtained, the question and answer information with the character similarity being larger than or equal to a first preset value is extracted and set as a key question phrase, response information corresponding to the key question phrase in the response database is obtained, and all the response information is combined and set as the reference text.
6. The method of claim 5, wherein said fusing all reference text to generate summary text comprises the steps of:
setting the question understanding character as a query text, sequentially combining all the reference texts based on the arrangement sequence of any identifier in the question understanding character, setting the reference text as a source text, and respectively splitting the query text and the source text into a plurality of word sequences;
Setting a binary cross entropy function as an objective function and constructing a abstract generation model, wherein the abstract generation model estimates the importance of the word sequence of the source text based on the word sequence of the query text, outputs the predicted value of the word sequence in the source text based on the importance, sets the word sequence with the predicted value being greater than or equal to a second preset value as an abstract sequence, sets abstract character length, and combines all the abstract sequences to generate the abstract text based on the abstract character length.
7. The method according to claim 1, wherein whether the question-answer text belongs to a first question-answer is judged based on a time sequence, if not, a semantic relevance between the question-answer text and a previous question-answer text is calculated, and if the semantic relevance is greater than or equal to a third preset value, the question-answer text and a abstract text corresponding to the previous question-answer text are fused to generate a new question-answer text.
8. An intelligent evaluation system for conversational speech for implementing the method of any one of claims 1-7, characterized in that the system comprises the following modules:
The voice acquisition module is used for acquiring a question-answer voice of a question-answer person after the call starts, recognizing the voice quality of the question-answer voice, converting the question-answer voice into standard voice based on the voice quality, extracting text characters of the standard voice and setting the text characters as a question-answer text; collecting a voice sample set, wherein the voice sample set comprises a plurality of normal voices and fuzzy voices, acquiring a voice waveform set corresponding to the voice sample set, extracting voice characteristics of all voice waveforms in the voice waveform set, comparing differences between the normal voice waveforms and the fuzzy voice waveforms in the voice waveform set based on the voice characteristics, generating a voice training model for recognizing the fuzzy voice waveforms based on the differences, inputting the voice waveform set into the voice training model, pre-training the voice training model by using a self-supervision learning method to generate a voice conversion model, inputting the question-answer voices into the voice conversion model, predicting pitch and strength corresponding to the question-answer voices by using the voice conversion model, generating a spectrogram based on the pitch and the strength, reconstructing the spectrogram to generate a conversion waveform, generating a conversion voice based on a frequency domain conversion method, and setting the conversion waveform as the standard voices;
The statistics module is used for enabling a responder to output response voice based on the question-answer text, obtaining a response text corresponding to the response voice and storing a call record between the question-answer and the responder so as to generate a call paragraph and a call record;
the analysis module is used for respectively extracting emotion words in the conversation paragraphs and response time distribution in the conversation records, generating question-answer scores based on the emotion words and generating response scores based on the response time distribution;
The evaluation output module is used for obtaining the question understanding characters corresponding to the question and answer texts, constructing a response database, searching the reference texts corresponding to the question understanding characters in the response database, fusing all the reference texts to generate abstract texts, comparing the coincidence degree of the response texts and the abstract texts, generating response scores based on the coincidence degree, and summarizing the question and answer scores, the response scores and the response scores to generate the call evaluation values of the call records.
9. A computer storage medium having stored thereon program instructions, wherein the program instructions, when run, control a device on which the computer storage medium is located to perform the method of any of claims 1-7.
CN202411101107.5A 2024-08-12 2024-08-12 Intelligent evaluation method, system and medium for call voice Active CN118982978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411101107.5A CN118982978B (en) 2024-08-12 2024-08-12 Intelligent evaluation method, system and medium for call voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411101107.5A CN118982978B (en) 2024-08-12 2024-08-12 Intelligent evaluation method, system and medium for call voice

Publications (2)

Publication Number Publication Date
CN118982978A CN118982978A (en) 2024-11-19
CN118982978B true CN118982978B (en) 2025-04-29

Family

ID=93452048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411101107.5A Active CN118982978B (en) 2024-08-12 2024-08-12 Intelligent evaluation method, system and medium for call voice

Country Status (1)

Country Link
CN (1) CN118982978B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119692667A (en) * 2024-11-21 2025-03-25 国网黑龙江省电力有限公司电力科学研究院 A non-stop diagnostic system for thermal power units

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448699A (en) * 2018-12-15 2019-03-08 深圳壹账通智能科技有限公司 Voice converting text method, apparatus, computer equipment and storage medium
CN114999533A (en) * 2022-06-09 2022-09-02 平安科技(深圳)有限公司 Intelligent question answering method, device, device and storage medium based on emotion recognition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176819B2 (en) * 2016-07-11 2019-01-08 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
KR102405547B1 (en) * 2020-09-15 2022-06-07 주식회사 퀄슨 Pronunciation evaluation system based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448699A (en) * 2018-12-15 2019-03-08 深圳壹账通智能科技有限公司 Voice converting text method, apparatus, computer equipment and storage medium
CN114999533A (en) * 2022-06-09 2022-09-02 平安科技(深圳)有限公司 Intelligent question answering method, device, device and storage medium based on emotion recognition

Also Published As

Publication number Publication date
CN118982978A (en) 2024-11-19

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
US12210832B2 (en) Method of responding based on sentence paraphrase recognition for dialog system
CN110347787B (en) Interview method and device based on AI auxiliary interview scene and terminal equipment
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
KR20200119410A (en) System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
CN115410560B (en) A speech recognition method, device, storage medium and equipment
CN115497465A (en) Voice interaction method and device, electronic equipment and storage medium
CN118982978B (en) Intelligent evaluation method, system and medium for call voice
CN110675292A (en) Child language ability evaluation method based on artificial intelligence
CN115292461B (en) Man-machine interaction learning method and system based on voice recognition
CN118194875B (en) Intelligent voice service management system and method driven by natural language understanding
CN114003700A (en) Method and system for processing session information, electronic device and storage medium
CN118737124A (en) An end-to-end pronunciation evaluation method based on Zipformer
CN117195864A (en) A question generation system based on answer awareness
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
Sharma et al. Accent Detection in Indian Languages through Convolutional Neural Network based Spectrogram Analysis
CN113822506A (en) A multi-round voice interactive intelligent retrieval system and method for power control
CN120218048A (en) Resume data parsing method, device, electronic device and storage medium
CN115881119A (en) Disambiguation method, system, refrigeration equipment and storage medium for fusion of prosodic features
CN118170919B (en) A method and system for classifying literary works
Vijaya et al. An Efficient System for Audio-Based Sign Language Translator Through MFCC Feature Extraction
Tumminia et al. Diarization of legal proceedings. Identifying and transcribing judicial speech from recorded court audio
CN118069805A (en) Intelligent question-answering method and device based on voice and text collaboration
CN117634471A (en) NLP quality inspection method and computer readable storage medium
CN117290782A (en) Retrieval model training method, knowledge question-answering method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant