[go: up one dir, main page]

CN108962284B - Voice recording method and device - Google Patents

Voice recording method and device Download PDF

Info

Publication number
CN108962284B
CN108962284B CN201810725856.3A CN201810725856A CN108962284B CN 108962284 B CN108962284 B CN 108962284B CN 201810725856 A CN201810725856 A CN 201810725856A CN 108962284 B CN108962284 B CN 108962284B
Authority
CN
China
Prior art keywords
recording
text
initial
signal
noise ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810725856.3A
Other languages
Chinese (zh)
Other versions
CN108962284A (en
Inventor
李栋梁
江键
江源
王智国
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201810725856.3A priority Critical patent/CN108962284B/en
Publication of CN108962284A publication Critical patent/CN108962284A/en
Application granted granted Critical
Publication of CN108962284B publication Critical patent/CN108962284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Document Processing Apparatus (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a voice recording method and a device, wherein the method comprises the following steps: recording the reading voice in the process of reading the target text by the user to obtain an initial recording, then detecting the recording environment and/or the recording quality of the initial recording, judging whether the detection result of the recording environment and/or the recording quality is qualified, and if the detection result is qualified, taking the initial recording as the target recording and keeping the target recording; and if the detection result is unqualified, discarding the initial recording. Therefore, after the target text read aloud by the user is recorded, the recording environment and/or the recording quality are/is detected, the detection result is obtained, the recording qualified in detection is reserved as the target recording, the recording unqualified in detection is discarded, and the reserved target recording can be utilized to form the voice library, so that the quality of the recording data in the voice library is improved.

Description

Voice recording method and device
Technical Field
The present application relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for recording speech.
Background
With the development of science and technology, personalized voice customization requirements in the fields of toys, home furnishing, medical treatment and the like are more and more. For example, a child may want to hear a parent working outside or on a business trip telling himself/herself in a toy at any time, an empty-nest elderly may want to hear children's voice often at home, and a cancer patient may want to leave his/her own voice for a parent to accompany. These application requirements can be achieved by personalized speech synthesis techniques.
In order to meet the requirement of the personalized voice application, a personalized voice library needs to be constructed, when the existing personalized voice synthesis system constructs the voice library, a user carries out autonomous recording according to a text to be recorded provided by the system, and then the recording data of the user is directly used for constructing the voice library. However, constructing a voice library directly using the recording data of the user may result in poor quality of the recording data in the voice library.
Disclosure of Invention
The embodiment of the present application mainly aims to provide a voice recording method and device, which can improve the quality of recorded data.
The embodiment of the application provides a voice recording method, which comprises the following steps:
recording the reading voice in the process of reading the target text by the user to obtain an initial recording;
detecting the recording environment and/or the recording quality of the initial recording;
judging whether the detection result of the recording environment and/or the recording quality is qualified or not;
if so, taking the initial recording as a target recording, and keeping the target recording;
and if not, discarding the initial recording.
Optionally, after discarding the initial audio record, the method further includes:
outputting a prompt for re-recording the target text;
and after the prompt is output, if the user is detected to read the target text again, continuing to execute the step of recording the read voice.
Optionally, the detecting the recording environment of the initial recording includes:
segmenting the initial recording into speech segments and non-speech segments;
calculating the signal-to-noise ratio of the voice segment;
correspondingly, the judging whether the detection result of the recording environment is qualified includes:
judging whether the signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold reaches a first preset proportion, determining that the detection result of the recording environment is qualified;
and if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion, determining that the detection result of the recording environment is unqualified.
Optionally, after the number of the signal-to-noise ratios greater than the first signal-to-noise ratio threshold reaches a first preset ratio, the method further includes:
if the initial recording is not the first recording of the current recording, acquiring the average value of the signal-to-noise ratio of at least one recorded recording before the initial recording as the average value of the signal-to-noise ratio;
judging whether the absolute value of the difference between the signal-to-noise ratio of the voice segment and the mean value of the signal-to-noise ratio is larger than a preset second signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion, executing the step of determining that the detection result of the recording environment is unqualified;
and if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion, executing the step of determining that the detection result of the recording environment is qualified.
Optionally, the detecting the recording quality of the initial recording includes:
carrying out voice recognition on the initial recording to obtain a recognition text;
determining a text correctness rate of the recognized text, wherein the text correctness rate is a ratio of matched text to the target text, and the matched text is text content matched with the target text in the recognized text;
correspondingly, the judging whether the detection result of the recording quality is qualified includes:
judging whether the text accuracy is greater than a preset accuracy threshold value or not;
if so, determining that the detection result of the recording quality is qualified;
and if not, determining that the detection result of the recording quality is unqualified.
Optionally, before detecting the recording environment and/or the recording quality of the initial recording, the method further includes:
and performing energy normalization on the initial recording to enable energy variation between the initial recording and other recorded recordings to tend to be smooth.
Optionally, the energy warping the initial recording includes:
determining the amplitude value of each sampling point in the initial sound recording, and sequencing the amplitude values from large to small;
acquiring at least two amplitude values sequenced in the front, and calculating the average value of the at least two amplitude values;
if the average value is greater than or equal to a preset upper limit value of the amplitude value, obtaining an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value;
if the average value is smaller than a preset lower limit value of the amplitude value, obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value;
and performing energy warping on the initial recording by using the energy warping coefficient.
Optionally, if the target text is a to-be-recorded text in a pre-constructed recording text set, the recording text set is constructed in the following manner:
splitting the collected original text corpus into unit texts to form a first text set;
selecting a preset number of unit texts from the first text set to form a second text set, wherein the second text set is equal to or similar to the first text set in terms of text component proportion;
and taking each unit text in the second text set as a text to be recorded to form a recording text set.
Optionally, the text to be recorded is a text subjected to a character replacement operation or not subjected to the character replacement operation, and the character replacement operation is an operation of replacing a rare word with a common word.
The embodiment of the present application further provides a voice recording apparatus, including:
the initial recording acquisition unit is used for recording the reading voice in the process of reading the target text by the user to obtain an initial recording;
the recording environment detection unit is used for detecting the recording environment of the initial recording; and/or, a recording quality detection unit for detecting the recording quality of the initial recording;
the initial recording judging unit is used for judging whether the detection result of the recording environment and/or the recording quality is qualified or not;
a target recording obtaining unit, configured to, if a detection result of the recording environment and/or the recording quality of the initial recording is qualified, take the initial recording as a target recording, and retain the target recording;
and the initial recording discarding unit is used for discarding the initial recording if the detection result of the recording environment and/or the recording quality of the initial recording is unqualified.
Optionally, the apparatus further comprises:
a re-recording prompt output unit for outputting a prompt for re-recording the target text;
and the recording step executing unit is used for triggering the initial recording obtaining unit to record the reading voice if the situation that the target text is read again by the user is detected after the prompt is output.
Optionally, the recording environment detecting unit includes:
a voice segment dividing subunit, configured to divide the initial recording into voice segments and non-voice segments;
the signal-to-noise ratio calculating subunit is used for calculating the signal-to-noise ratio of the voice segment;
correspondingly, the initial recording judgment unit comprises:
a first signal-to-noise ratio judging subunit, configured to judge whether a signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold;
the first qualification determining subunit is configured to determine that the detection result of the recording environment is qualified if the number of signal-to-noise ratios greater than the first signal-to-noise ratio threshold reaches a first preset ratio;
and the first disqualification determining subunit is used for determining that the detection result of the recording environment is disqualified if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion.
Optionally, the initial recording determining unit further includes:
a signal-to-noise ratio average value obtaining subunit, configured to obtain an average value of signal-to-noise ratios of at least one recorded sound recording before the initial sound recording, as a signal-to-noise ratio average value, if the initial sound recording is not a first sound recording of the current sound recording;
a second signal-to-noise ratio judging subunit, configured to judge whether an absolute value of a difference between the signal-to-noise ratio of the voice segment and the average value of the signal-to-noise ratio is greater than a preset second signal-to-noise ratio threshold;
the second qualification subunit is used for executing the step of determining that the detection result of the recording environment is unqualified if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion;
and the second unqualified determination subunit is used for executing the step of determining that the detection result of the recording environment is qualified if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion.
Optionally, the recording quality detecting unit includes:
the recognition text acquisition subunit is used for carrying out voice recognition on the initial recording to obtain a recognition text;
a text correctness determining subunit, configured to determine a text correctness of the recognized text, where the text correctness is a ratio of a matching text to the target text, and the matching text is a text content in the recognized text that matches the target text;
correspondingly, the initial recording judgment unit comprises:
the text correct rate judging subunit is used for judging whether the text correct rate is greater than a preset correct rate threshold value;
a third qualification determining subunit, configured to determine that the detection result of the recording quality is qualified if the text correctness is greater than a preset correctness threshold;
and the fourth unqualified determination subunit is used for determining that the detection result of the recording quality is unqualified if the text accuracy is not greater than a preset accuracy threshold.
Optionally, the apparatus further comprises:
and the energy warping unit is used for performing energy warping on the initial recording so that the energy change between the initial recording and other recorded recordings tends to be stable.
Optionally, the energy normalization unit includes:
the amplitude value determining subunit is used for determining the amplitude value of each sampling point in the initial sound recording and sequencing the amplitude values from large to small;
the average value operator unit is used for acquiring at least two amplitude values which are sequenced in the front and calculating the average value of the at least two amplitude values;
a first coefficient determining subunit, configured to obtain an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value if the average value is greater than or equal to the preset upper limit value of the amplitude value;
the second coefficient determining subunit is used for obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value if the average value is smaller than the preset lower limit value of the amplitude value;
and the energy warping subunit is used for performing energy warping on the initial recording by utilizing the energy warping coefficient.
Optionally, if the target text is a to-be-recorded text in a pre-constructed recording text set, the apparatus further includes:
the first text set forming unit is used for splitting the collected original text corpora into unit texts to form a first text set;
a second text set forming unit, configured to select a preset number of unit texts from the first text set to form a second text set, where a text component ratio of the second text set is equal to or approximate to that of the first text set;
and the recording text set forming unit is used for taking each unit text in the second text set as a text to be recorded to form a recording text set.
Optionally, the text to be recorded is a text subjected to a character replacement operation or not subjected to the character replacement operation, and the character replacement operation is an operation of replacing a rare word with a common word.
The embodiment of the present application further provides a voice recording apparatus, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation manner of the voice recording method.
An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the voice recording method.
The embodiment of the present application further provides a computer program product, which when running on a terminal device, enables the terminal device to execute any implementation manner of the voice recording method.
According to the voice recording method and the voice recording device, in the process of reading the target text aloud by a user, recording the aloud voice to obtain an initial recording, then detecting the recording environment and/or the recording quality of the initial recording, then judging whether the detection result of the recording environment and/or the recording quality is qualified, if so, taking the initial recording as the target recording, and keeping the target recording; and if the detection result is unqualified, discarding the initial recording. Therefore, after the target text read aloud by the user is recorded, the recording environment and/or the recording quality are/is detected, the detection result is obtained, the recording qualified in detection is reserved as the target recording, the recording unqualified in detection is discarded, and the reserved target recording can be utilized to form the voice library, so that the quality of the recording data in the voice library is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a voice recording method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of constructing a recording text set according to an embodiment of the present application;
FIG. 3 is a schematic flowchart illustrating a recording environment for detecting an initial recording according to an embodiment of the present application;
FIG. 4 is a schematic flowchart illustrating a recording quality detection process for an initial recording according to an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a voice recording apparatus according to an embodiment of the present application;
fig. 6 is a schematic hardware structure diagram of a voice recording apparatus according to an embodiment of the present application.
Detailed Description
In some voice recording methods, a plurality of recording texts can be selected from mass texts in various fields for a user to select to perform autonomous recording, and then the recording data of the user is directly utilized to form a voice library, so that personalized voice of the user can be synthesized by utilizing the recording data of the user in the voice library in the following process.
In order to solve the above-mentioned drawbacks, an embodiment of the present application provides a voice recording method, which provides a readable target text for a user, records the readable voice in a process of reading the target text by the user to obtain an initial recording, detects a recording environment and/or a recording quality of the obtained initial recording, if a result of checking the recording environment and/or the recording quality of the initial recording is determined to be acceptable, the initial recording may be reserved as the target recording, and if a result of checking the recording environment and/or the recording quality of the initial recording is determined to be unacceptable, the initial recording may be discarded.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First embodiment
Referring to fig. 1, a flow chart of a voice recording method provided in this embodiment is schematically illustrated, where the method includes the following steps:
s101: and recording the reading voice to obtain initial recording in the process of reading the target text by the user.
In this embodiment, any recording text that realizes voice recording by using this embodiment is defined as a target text, and the embodiment does not limit the language type of the target text, for example, the target text may be a chinese recording text or an english recording text, and the embodiment also does not limit the length of the target text, for example, the target text may be a sentence of text or a paragraph of text.
In some voice recording methods, the recording text selected for the user is selected from massive texts in various fields in advance, some rare words or words with strong speciality may exist, and for common users, the words are often stubborn and difficult to read, and the recording difficulty is high.
Therefore, in order to facilitate voice recording of common users and reduce the appearance of uncommon words or terms with strong professionality, the embodiment can select a large number of recording texts on the basis of the novel texts and/or the story texts to form a recording text set instead of selecting from a large number of texts in various fields, that is, the embodiment can select a plurality of recording texts from a large number of novel texts and/or the story texts in advance to form the recording text set by virtue of the advantages of the legibility of languages in the novel texts and/or the story texts, the diversity of story scenes and the like, but the embodiment does not limit the selection way of the recording texts, so that the embodiment can select texts from other comparatively popular and understandable texts besides the novel texts and the story texts, such as lines of movie and television recordings, news manuscripts and the like, and then the user can select any one recording text from the recording text set as a target text, and the target text is recorded by voice by adopting the embodiment. The specific process of selecting the recorded texts to form the recorded text set can be referred to the related description of the second embodiment.
It should be noted that, before the user records the voice, the user selects a target text from the recording text set, or the recording system automatically selects a target text from the recording text set, and displays the target text to the user through the screen, and records the read-aloud voice of the user in the process of reading the target text aloud by the user, so as to obtain an initial recording, for example, in the process of reading the target text aloud by the user, "good morning" is obtained by recording the read-aloud voice of the user, so as to obtain an initial recording of the user, where the recording content is "good morning", and after obtaining the initial recording of the user, the initial recording may be detected by using subsequent steps of this embodiment, so as to determine whether the initial recording is qualified.
S102: the recording environment and/or recording quality of the initial recording is detected.
In this embodiment, after the initial recording of the user is acquired in step S101, in order to ensure the quality of the recorded voice, the acquired initial recording needs to be detected to determine whether it is a qualified recording.
Specifically, the detection process of the initial recording is to detect the recording environment and/or the recording quality of the initial recording, so that the initial recording data which is not in accordance with the requirements, such as poor environment, poor recording quality and the like can be filtered, and the quality of the recorded voice is further ensured.
The recording environment refers to an environment where the user is located when recording the read-aloud voice in the process of reading the target text by the user in step 101, for example, the user may record in a relatively quiet environment, such as at home, or may record in a relatively noisy environment, such as at the roadside, and the difference in the recording environment greatly affects the quality of the recorded voice. The recorded voice in a noisy environment will have poor quality due to the higher noise, and correspondingly, the recorded voice in a quiet environment will have higher quality due to no noise interference.
For example, if the target text read aloud by the user is "the temperature in tomorrow may be 30 degrees celsius", and if the obtained recording content of the user is "the temperature in tomorrow is 30 degrees celsius", it indicates that the recording is not complete and does not correspond to the target text in whole sentence, and there is a problem that the recording is not performed according to the text, thereby affecting the quality of the recorded voice.
It should be noted that, for a specific process of detecting the recording environment and/or the recording quality of the initial recording, reference may be made to the following description of the fourth embodiment.
S103: and judging whether the detection result of the recording environment and/or the recording quality is qualified.
In this embodiment, after the recording environment and/or the recording quality of the initial recording of the user is detected in step S102, a detection result of the recording environment and/or the recording quality of the initial recording may be obtained, and further, whether the detection result of the recording environment and/or the recording quality is qualified may be determined according to the detection result. For example, if the detection result shows that the recording environment of the initial recording is poor, the noise is too large, or the recording quality is too poor, the detection result can be determined to be unqualified, and it can be understood that the determination condition of whether the specific detection result is qualified or not can be set according to the actual situation, which is not limited in this embodiment.
It should be noted that, through step S103, if the detection result of the recording environment and/or the recording quality of the initial recording is judged to be qualified, step S104 may be continuously executed; if the detection result of the recording environment and/or the recording quality of the initial recording is determined to be not qualified, the step S105 is continuously executed.
S104: and if so, taking the initial recording as the target recording and keeping the target recording.
In this embodiment, it may be determined whether the detection result of the recording environment and/or the recording quality of the initial recording of the user is qualified through step S103, and if it is determined that the detection result of the recording environment and/or the recording quality of the initial recording is qualified, the initial recording may be used as the target recording, and the target recording may be reserved.
At this time, the target voices can be further utilized to form a personalized voice library, so that all the target voices recorded by the user are contained in the voice library, and due to the good recording environment and high recording quality of the target voices, more personalized application requirements can be met, namely, the personalized voices of the user can be synthesized under different scenes such as story reading, conversation interaction and the like. For example, a parent working or on a business trip can record a target voice in advance to form a personalized voice sound library, and then can use the target voice in the voice sound library to synthesize a voice story containing the target voice of the parent in the toy, so that a child can hear the voice story synthesized by the target voice recorded in advance by the parent in the toy at any time, similarly, an empty-nest elderly can also frequently hear the voice of a child at home by using the method, a cancer patient can also leave the voice of the cancer patient for the parent to consolidate, and the like.
Further, an optional implementation manner is that after a target voice of the user is successfully acquired in step S104, the user may further continue to record a next target recording, that is, the processes in steps S101 to S104 are repeatedly executed to realize recording of the next target voice, so that a voice library is formed by acquiring enough target voices.
S105: if not, discarding the initial recording.
In this embodiment, step S103 may determine whether the detection result of the recording environment and/or the recording quality of the initial recording of the user is qualified, and if the detection result of the recording environment and/or the recording quality of the initial recording is determined to be unqualified, it indicates that the quality of the initial recording is poor, and the detection result does not meet the condition of forming the voice library as the target recording, and may discard the detection result.
Further, an optional implementation manner is that, after discarding the initial recording with the unqualified detection result in step S105, in order to ensure the integrity of the voice library and improve the quality of the recorded voice, the embodiment may further include the following steps:
step A: and outputting a prompt for re-recording the target text.
In this implementation manner, in order to improve the quality of recorded voice, and enrich the voice library, and further improve the integrity of recording coverage in the voice library, after discarding the initial recording with the unqualified detection result in step S105, further, a prompt for re-recording the target text may be output to the user, where the target text refers to the target text corresponding to the initial recording discarded in step S105, and the prompt for re-recording the text may be displayed to the user in a text form or in a voice broadcast manner, and a specific prompt manner may be set according to an actual situation, which is not limited in this embodiment.
And B: after outputting the prompt of re-recording the target file, if it is detected that the user reads the target text again, step S101 is continuously executed.
In this implementation manner, after the prompt of re-recording the target text is output to the user through step a, if it is detected that the user re-reads the target text, the scheme of steps S101-S103 may be used to record the read voice of the user re-reads the target text, and detect the recording, and then determine whether the re-recorded voice of the user is qualified according to the detection result, if so, the implementation process of step S104 may be executed, the re-recorded voice is used as the target recording to form a voice library, if the detection result of the re-recorded voice is still not qualified, the implementation process of step S105 may be executed, the re-recorded voice is discarded, and step a and step B are executed again, and the execution is repeated in this way until the target recording of the user for the target text is obtained, and of course, a number threshold may be set for the number of re-recording times of the same target text, for example, 3 times, if the number of re-recording times reaches 3 times and the initial recording recorded at the 3 rd time is still not suitable, the re-recording of the target text is abandoned.
In summary, in the voice recording method provided in this embodiment, in the process of reading the target text by the user, the read voice is recorded to obtain an initial recording, then, the recording environment and/or the recording quality of the initial recording are detected, then, whether the detection result of the recording environment and/or the recording quality is qualified is determined, if the detection result is qualified, the initial recording is used as the target recording, and the target recording is reserved; and if the detection result is unqualified, discarding the initial recording. It can be seen that, in the embodiment, after the target text read aloud by the user is recorded, the recording environment and/or the recording quality are/is detected to obtain the detection result, the recording qualified in detection is reserved as the target recording, the recording unqualified in detection is discarded, and then the reserved target recording can be utilized to form the voice library, so that the quality of the recording data in the voice library is improved.
Second embodiment
It should be noted that, in this embodiment, in order to facilitate a common user to record a voice, reduce the recording difficulty, and reduce a recording text containing rare words or words with strong professionality, before a voice library is constructed, a recording text set is constructed in advance in this embodiment for the user to select a target text for reading aloud and recording, where the set includes a plurality of texts to be recorded selected from a large number of comparatively popular and easy-to-understand texts (such as a novel text and/or a story text, etc.), and it can be understood that the target text selected for the user may be any text to be recorded in the voice text set.
Next, the present embodiment will describe a construction process of the recording text set through the following steps S201 to S203.
Referring to fig. 2, a schematic flow chart of constructing a recording text set provided in this embodiment is shown, where the flow includes the following steps:
s201: and splitting the collected original text corpus into unit texts to form a first text set.
In this embodiment, first, the collected original text corpus is split into unit texts, and the unit texts are used to form a first text set. The original text corpus refers to a text material used for extracting a recording text, and the unit text refers to each whole sentence text obtained by segmenting the original text corpus, specifically, the punctuations indicating the end of a sentence in the original text corpus can be used for segmenting the original text corpus to obtain each whole sentence text, for example, the sentence numbers, exclamation marks and question marks at the end of the sentence in the original text corpus can be used for segmenting the text, for example, the punctuations indicating the end of the sentence such as the sentence numbers, the exclamation marks and question marks are defined as special punctuations, the sentence section between every two adjacent special punctuations in the original text corpus can be used as the unit text, and of course, the text sentence between any two adjacent punctuations and reaching a preset length (for example, more than 10 words) can also be used as the unit text.
In an implementation manner, the original text corpus in this embodiment may include a novel text and/or a story text, where the two belong to the same field, dialog interactions between characters may be involved in the novel text and/or the story text, and the two include a large amount of text data and a large number of text types, such as science fiction, adventure, reasoning, and the like, and the novel text and/or the story text generally does not include uncommon words or highly specialized texts, and is convenient for ordinary users to read aloud.
Further, in order to ensure the validity and normalization of the unit text in the first text set and facilitate reading by the user, the unit text in the first text set needs to satisfy the following three points:
first, the unit text in the first text set may not contain special characters, for example, may not contain special characters such as japanese and greek letters, so that it may be ensured that each text to be recorded in the subsequent second text set does not contain special characters.
Secondly, the unit text in the first text set can not contain sensitive words, and the unit text is ensured to be in accordance with the legal specification, so that each text to be recorded in the subsequent second text set can be ensured not to contain the sensitive words.
Thirdly, in order to facilitate reading aloud by the user and improve user experience, the number of words of the unit text in the first text set cannot exceed a preset word number threshold, so that the phenomenon that the unit text is too long is avoided, and reading aloud by the user is not facilitated, for example, the word number threshold is set to be 500 words, so that the length of each unit text does not exceed 500 words; further, a word count threshold may also be set for each sentence in a unit text (i.e., a text sentence between two adjacent punctuations), for example, the word count threshold is set to 50 words, so that the length of each sentence cannot exceed 50 words. Therefore, when the user reads each unit text, reading obstacles can not occur due to overlong unit texts and/or overlong sentences, and user experience is improved.
S202: and selecting a preset number of unit texts from the first text set to form a second text set, wherein the second text set is equal to or similar to the first text set in text component proportion.
In this embodiment, after the first text set is formed in step S201, a preset number of unit texts may be further selected from the first text set to form a second text set, where the second text set and the first text set have equal or similar text component ratios. It is understood that the selection of the preset number of unit texts from the first text set refers to the selection of a preset specified number of unit texts from the first text set, for example, if the preset number is 10, 10 unit texts may be selected from the 100 unit texts in the first text set to form the second text set, assuming that the first text set includes 100 unit texts.
It should be noted that, the ratio of the text components of the second text set to the text components of the first text set being equal or approximate refers to the ratio of the respective text components constituting the respective unit texts in the second text set to the ratio of the respective text components constituting the respective unit texts in the first text set to the respective text components constituting the corresponding respective unit texts in the second text set to the first text set. For example, the following steps are carried out: assuming that the first text set contains 10000 different words, wherein each word occupies a different proportion in the first text set, for example, a common word like "hello" may occupy a high proportion in the first text set, and after calculating the proportion of each word in the first text set occupying the text set, each word may be sorted in a descending order according to the proportion size, and some words ranked in the top are obtained, that is, some words with a higher proportion are obtained, for example, the common word "hello" occupies 1% of the first text set and belongs to a high proportion, when the unit text selected from the first text set constitutes the second text set, the proportion of the common word "hello" occupying the second text set in the second text set should also be 1% or approximately 1%. For example, 1.1% or the like may be used. In this way, it is ensured that the text in the second set of text can have a higher coverage of the text in the first set of text.
It should be noted that, in the process of selecting a preset number of unit texts from the first text set to form the second text set, the embodiment is to select the unit texts by an automatic selection method, the unit texts in the first text are automatically selected, and an optional realization mode is that the adopted automatic selection method is a phoneme coverage rate statistical method, when the method is used for selecting the unit texts in the first text set, combining the characteristics of initial consonants, vowels, syllables, word faces, rhythm boundaries, sentence patterns, sentence lengths and the like, selecting a preset number of unit texts from the first text set to form a second text set, the second text set is equal or similar to the first text set in terms of text component proportion, that is, the second text set automatically selected by the method has better coverage on the first text set, that is, the second text set also has better coverage on the collected original text corpus.
S203: and taking each unit text in the second text set as a text to be recorded to form a recording text set.
In this embodiment, after the second text set is formed in step S202, each unit of text in the second text set may be further used as a text to be recorded, where the text to be recorded refers to a recorded text that can be selected and read by a user, and in order to ensure validity and normalization of the text to be recorded, it should be ensured that the text to be recorded is a text that does not include special characters and/or sensitive words, that is, the special characters and the sensitive words in the text to be recorded need to be deleted by manual screening or automatic screening.
In addition, in an implementation manner, the text to be recorded may be a text that is subjected to a character replacement operation or a text that is not subjected to the character replacement operation, where the character replacement operation refers to an operation of replacing uncommon words with common words, for example, if a certain text to be recorded contains uncommon words "Di", the uncommon words "Di" may be replaced with the common words "di", thereby forming the text to be recorded that is subjected to the character replacement operation, and correspondingly, the text that is not subjected to the character replacement operation refers to each unit of text that does not contain uncommon words and does not need to be subjected to the character replacement operation in the second set.
Furthermore, any text to be recorded, which does not contain special characters and/or sensitive words and is subjected to character replacement operation or not subjected to character replacement operation, in the second text set can be utilized to form a recording text set, so that the user can select the target text from the text set for personalized voice recording.
Therefore, in the embodiment, a plurality of texts to be recorded are selected from a large amount of novel texts and story texts in advance to construct a recording text set, so that a user can select a target text from the recording text set to read and record, wherein any text to be recorded does not contain special characters, sensitive words and uncommon words, so that the text to be recorded is easy to understand and read, the recording difficulty is reduced, and the quality of recording data can be improved to a certain extent.
Third embodiment
It can be understood that, after the initial recording is obtained through step S101 of the first embodiment, further, in order to ensure the quality of the recorded voice, the initial recording needs to be subjected to quality detection, and then whether the initial recording is qualified is determined, but because different initial recording data can be recorded by different recording devices, formats of the recording data may be various, and the recording quality may also be different, therefore, before detecting the recording environment and/or the recording quality of the initial recording, the initial recording may be preprocessed, and then, subsequent monitoring of the recording environment and the recording quality is performed, and then it is ensured that the obtained recording data is high-quality recording data meeting system requirements.
An optional implementation manner of preprocessing the initial recording is to perform format normalization on the initial recording, so that the format of the initial recording is a preset audio format.
In this implementation manner, in order to facilitate subsequent monitoring of the recording environment and the recording quality and ensure that a high-quality signal meeting the system requirements is obtained, the initial recording may be converted into a uniform format in advance, for example, all the obtained initial recordings may be converted into wav format audio files with a sampling rate of 16k and a sampling precision of 2 bytes (16 bits).
In addition, another alternative implementation of preprocessing the initial recording is to perform energy warping on the initial recording to make the energy variation between the initial recording and other recorded recordings tend to be smooth.
In the implementation mode, in order to avoid the situation that the energy jump of the system synthesized voice is caused by the overlarge fluctuation of the recording energy in the recording sentence, the stability of the effect of the system synthesized voice is ensured, the energy normalization can be performed on the initial recording in advance, and the energy change between the initial recording and other recorded initial recordings tends to be stable.
Next, the present embodiment will describe a specific process of performing energy normalization on the initial recording in this implementation manner through the following steps C to G.
And C: and determining the amplitude value of each sampling point in the initial sound recording, and sequencing the amplitude values from large to small.
In this embodiment, in order to perform energy normalization on the initial recording data, the amplitude value of each sampling point in the initial recording is first determined, and then the amplitude values are sorted from large to small so as to perform the subsequent steps.
Step D: at least two amplitude values ordered in the front are acquired, and the average of the at least two amplitude values is calculated.
In this embodiment, after sorting the amplitude values from large to small in step C, further, the amplitude values corresponding to at least two sampling points sorted in the front may be obtained, and an average value of the at least two amplitude values may be calculatedmaxThe average value is expressed, and specifically, the calculation formula of the number of sampling points for calculating the average value is:
Figure GDA0001739776940000161
where N represents the number of all sampling points in the initial recording,
Figure GDA0001739776940000162
indicating the picking ratio, typically
Figure GDA0001739776940000163
From 5% to 10%, n representing the number of sampling points at which this average calculation is made.
Further, the calculation formula of the average value of the amplitude values of the n sampling points is as follows:
Figure GDA0001739776940000164
wherein, the dataiThe sampling value of the ith sampling point in the initial voice data is represented, abs represents the amplitude value, and sort represents the order of the amplitude values from large to small.
Step E: and if the average value is greater than or equal to the preset upper limit value of the amplitude value, obtaining an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value.
Step F: and if the average value is smaller than the preset lower limit value of the amplitude value, obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value.
In this embodiment, the average data of the amplitude values corresponding to at least two sampling points in the front of the initial sound recording sequence is obtained through step DmaxThereafter, further, the average value data may be obtained by comparing the upper limit value and the lower limit value of the amplitude value set in advance with the average value datamaxThe energy warping coefficient is calculated, and the specific calculation formula is as follows:
Figure GDA0001739776940000165
wherein low and high respectively represent a preset lower limit value and an upper limit value of the amplitude value, and Rate represents an energy regulation coefficient.
According to the above calculation formula, if the average value datamaxIf the amplitude value is larger than or equal to the preset upper limit value high of the amplitude value, the energy warping coefficient high/data smaller than 1 can be obtainedmaxIf the average value datamaxIf the amplitude value is less than the preset lower limit value low of the amplitude value, the energy warping coefficient low/data greater than 1 can be obtainedmax
Step G: and carrying out energy normalization on the initial recording by utilizing the energy normalization coefficient.
In this embodiment, after the energy warping coefficient Rate is obtained according to the step E or the step F, the energy warping coefficient Rate may be further used to perform energy warping on the initial recording, and a specific calculation formula is as follows:
datanorm=datai*Rate
wherein, the datanormAnd representing the amplitude value after the energy normalization of the sampling value of the ith sampling point in the initial recording data.
Therefore, the format of the initial recording is normalized in advance, so that each initial recording is converted into a uniform format, and the recording environment and the recording quality are conveniently monitored; and moreover, the energy normalization is carried out on the initial recording in advance, so that the phenomenon that recording energy fluctuation is overlarge among the initial recordings is avoided, and the energy stability of synthesizing voice by using the voice sound library is ensured when the initial recording is used as a target recording to form the voice sound library.
Fourth embodiment
It can be understood that, after the initial recording obtained in step S101 of the first embodiment is subjected to preprocessing operations such as format normalization and energy normalization by using the third embodiment, in order to ensure the quality of the recorded voice, step S102 of the first embodiment is further executed to detect the initial recording, and then determine whether the initial recording is qualified.
Next, the present embodiment will describe a specific implementation manner of detecting the recording environment of the initial recording in step S102 in the first embodiment through steps S301 to S302 described below.
Referring to fig. 3, a schematic flow chart of detecting a recording environment of an initial recording according to the present embodiment is shown, where the flow includes the following steps:
s301: the initial audio recording is divided into individual speech segments and individual non-speech segments.
In this embodiment, in order to ensure the quality and stability of the recording data, the recording environment of the initial recording may be detected, wherein, first, the initial recording needs to be divided into each voice segment and each non-voice segment by using a corresponding recording data dividing method, for example, the initial recording may be divided into each voice segment and each non-voice segment by using an endpoint detection technique and a method of analyzing the short-time energy and the short-time zero-crossing rate of the initial recording, and the start and stop positions of each voice segment and each non-voice segment are calibrated.
S302: and calculating the signal-to-noise ratio of the voice segment.
In this embodiment, after the initial recording is divided into each speech segment and each non-speech segment in step S301, further, the snr of each speech segment can be calculated.
Specifically, each voice segment may be selected according to the start-stop position of each voice segment and non-voice segment calibrated in step S301, and a signal-to-noise ratio of each voice segment may be calculated, where the signal-to-noise ratio refers to a parameter describing a proportional relationship between voice components and noise components in the initial recording, and may reflect the recording environment quality of the initial recording to some extent.
After calculating the signal-to-noise ratio of the initial recording voice segment, further, the specific implementation process of "determining whether the detection result of the recording environment is qualified" in step S103 of the first embodiment includes:
step a: and judging whether the signal-to-noise ratio of the voice fragment of the initial recording is greater than a preset first signal-to-noise ratio threshold value.
In this implementation manner, after the signal-to-noise ratio of each voice segment of the initial recording is calculated in the step S302, whether the inspection result of the recording environment of the initial recording is qualified can be determined by determining whether the signal-to-noise ratio of each voice segment is greater than a first signal-to-noise ratio threshold preset by the recording system.
It should be noted that, if the calculated snr of the voice segment of the initial recording is greater than the first snr threshold preset by the system, specifically, the snr of all or most of the voice segments is greater than the first snr threshold, it indicates that the recording environment of the initial recording meets the system requirement, and step b may be continuously executed, otherwise, it indicates that the recording environment of the initial recording does not meet the system requirement, and step c may be continuously executed.
Step b: and if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold reaches a first preset proportion, determining that the detection result of the recording environment of the initial recording is qualified.
Step c: and if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion, determining that the detection result of the recording environment of the initial recording is unqualified.
In general, the first preset proportion may be a proportion value greater than or equal to 50%, which refers to the proportion of the number of signal-to-noise ratios greater than the first signal-to-noise ratio threshold value to the total signal-to-noise ratio (i.e. the signal-to-noise ratio of the total voice segment of the initial voice).
It should be noted that, if it is determined through the step c that the detection result of the recording environment of the initial recording is not qualified, an optional implementation manner is that, in order to ensure the integrity of the voice library and improve the quality of the recorded voice, the target text corresponding to the initial recording may be re-recorded with the steps a to B in the first embodiment.
Further, in step b, if the number of the signal-to-noise ratios greater than the first signal-to-noise ratio threshold reaches a first preset ratio, before determining that the detection result of the recording environment is qualified, the embodiment may further include the following steps:
step d: and if the initial recording is not the first recording of the current recording, acquiring the average value of the signal-to-noise ratio of at least one recorded recording before the initial recording as the average value of the signal-to-noise ratio.
In this embodiment, if the obtained initial recording is the first recording of the current recording, the recording quality of the initial recording may be continuously detected, and the specific detection method may be referred to in the following related description; if the obtained initial recording is not the first recording of the current recording, the average value of the signal-to-noise ratio of at least one recorded recording before the initial recording needs to be obtained as the average value of the signal-to-noise ratio.
Specifically, if the obtained initial recording is not the first recording of the current recording, the average of the signal-to-noise ratios of n (n ≧ 2) recorded recordings before the initial recording in the speech sound library needs to be calculated as the average of the signal-to-noise ratios, and the specific calculation formula is as follows:
Figure GDA0001739776940000191
wherein the SNRmeanRepresenting the mean signal-to-noise ratio, SNR, of n recorded recordings preceding the initial recording in a speech sound librarymRepresenting the signal-to-noise ratio of the mth recording of the n recorded recordings preceding the initial recording in the speech sound library.
Then, calculating SNR of voice segment of the initial recording and SNR mean SNR of n recorded recordings before the initial recording in voice librarymeanThe specific calculation formula of the absolute value of the variation difference value is as follows:
ΔSNR=abs(SNRcur-SNRmean)
wherein Δ SNR represents the SNR of the speech segment of the initial recordingcurSNR is compared with the average SNR of n recorded sound recordings before the initial sound recording in a voice sound bankmeanThe absolute value of the difference between the changes reflects the change of the recording environment of the initial recording, and a larger difference indicates a larger difference between the previous and subsequent recording environments, and the SNRcurThe signal-to-noise ratio of the current initial recorded speech segment is indicated.
Step e: and judging whether the absolute value of the difference between the signal-to-noise ratio of the voice segment and the mean value of the signal-to-noise ratio is larger than a preset second signal-to-noise ratio threshold value or not.
In this embodiment, after the Δ SNR corresponding to each voice segment is calculated in step d, it can be determined whether the Δ SNR corresponding to each voice segment is greater than a second SNR threshold preset by the system.
Step f: and if the number of the signal-to-noise ratios greater than the second signal-to-noise ratio threshold reaches a second preset proportion, determining that the detection result of the recording environment of the initial recording is unqualified.
In general, the second preset ratio may be a ratio value greater than or equal to 50%, which refers to a ratio of the number of signal-to-noise ratios greater than the second signal-to-noise ratio threshold to the total signal-to-noise ratio (i.e., the signal-to-noise ratio of the total voice segments of the initial voice).
In this embodiment, if the Δ SNR with the second preset ratio is greater than the preset second SNR threshold, it indicates that the recording environment of the initial recording is changed and has a large difference from the previous recording environment, and it is determined that the detection result of the recording environment of the initial recording is not qualified. At this time, in order to ensure the integrity of the voice library and improve the quality of the recorded voice, one implementation manner may be to perform re-recording of the target text corresponding to the initial recording by using the steps a to B in the first embodiment.
Step g: and if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion, determining that the detection result of the recording environment of the initial recording is qualified.
In this embodiment, if the Δ SNR without the second preset proportion is greater than the preset second SNR threshold, it indicates that the recording environment of the initial recording is changed, but within the acceptable change range of the system, the recording quality of the initial recording can be further detected, and the specific detection method may be referred to in the following description.
Therefore, the detection of the recording environment of the initial voice is helpful for judging whether the recording environment and the possible changes thereof meet the recording requirements of the voice library, so that the recording data quality and the recording stability are ensured.
It should be noted that, the recording of the voice library is recorded by the user, and during recording, although the system provides the recording text, the problem that the recording of the user is not according to the text and the recording of the whole sentence is incomplete still exists, which seriously damages the integrity of the recording coverage in the voice library, and leads to unsatisfactory pronunciation of some recordings during personalized voice synthesis. Therefore, after obtaining an initial recording of a user, a quality check of the initial recording is required.
Next, the present embodiment will describe a specific implementation manner of detecting the recording quality of the initial recording in step S102 of the first embodiment through steps S401 to S402 described below.
Referring to fig. 4, a schematic diagram of a process for detecting the recording quality of an initial recording according to this embodiment is shown, where the process includes the following steps:
s401: and carrying out voice recognition on the initial recording to obtain a recognition text.
In this embodiment, in order to perform quality detection on an initial sound recording, a speech recognition algorithm is first used to perform speech recognition on the initial sound recording, so as to obtain a recognition text corresponding to the initial speech.
S402: and determining the text correctness of the recognized text, wherein the text correctness is the ratio of the matched text to the target text, and the matched text is the text content matched with the target text in the recognized text.
In this embodiment, after the identification text corresponding to the initial recording is obtained in step S401, the identification text may be further compared with a target text selected by the user from the recording text set for reading aloud and forming the initial recording, and according to a comparison condition, a text correctness of the identification text is calculated, where the text correctness is a ratio of the matching text to the target text, and the matching text is a text content in the identification text that matches the target text.
For example, the following steps are carried out: assuming that the target text selected by the user from the recording text set contains 20 words and the recognition text corresponding to the initial recording obtained in step S401 contains 25 words, where 17 words are consistent with the words contained in the target text, the matching text is 17 words, and accordingly, the ratio of the matching text to the target text is 85% (i.e., 17/20 × 100%), i.e., the text correctness of the recognition text is 85%.
After determining the text accuracy of the identification text corresponding to the initial recording, further, the specific implementation process of "determining whether the detection result of the recording quality is qualified" in step S103 in the first embodiment includes:
judging whether the text accuracy is greater than a preset accuracy threshold; if so, determining that the detection result of the recording quality is qualified; and if not, determining that the detection result of the recording quality is unqualified.
In this embodiment, after the text accuracy of the identification text corresponding to the initial recording is determined in step S402, it may be determined whether the result of checking the recording quality of the initial recording is qualified by determining whether the text accuracy is greater than an accuracy threshold preset by the recording system. If the text accuracy is judged to be larger than the accuracy threshold preset by the recording system, the detection result of the recording quality of the initial recording can be determined to be qualified, and the initial recording can be further added into a voice library as a target recording; correspondingly, if the text accuracy is not greater than the accuracy threshold preset by the recording system, it may be determined that the detection result of the recording quality of the initial recording is not qualified, and at this time, in order to ensure the integrity of the voice library and improve the quality of the recorded voice, an implementation manner is that the target text corresponding to the initial recording may be re-recorded with steps a to B in the first embodiment.
The accuracy threshold preset by the system may be an average recognition accuracy of the recognition system in the novel text field, or may be set according to experience and actual conditions, which is not limited in this embodiment.
It should be further noted that, when detecting the initial recording, the present application may first detect the recording environment of the initial recording, and then detect the recording quality of the initial recording, or may first detect the recording quality of the initial recording, and then detect the recording environment of the initial recording, or only perform one of the checks according to actual needs, and this embodiment does not limit the order of detection of the two.
In conclusion, the recording environment and/or the recording quality of the initial recording are/is detected, and the recording data quality and the recording stability are guaranteed. Meanwhile, initial voice qualified in recording detection is reserved as target recording to form a voice library, so that the quality of recording data in the voice library is improved.
Fifth embodiment
In this embodiment, a voice recording apparatus will be described, and please refer to the above method embodiment for related contents. Referring to fig. 5, a schematic diagram of a voice recording apparatus provided in this embodiment is shown, where the apparatus 500 includes:
an initial recording obtaining unit 501, configured to record the reading voice in the process of reading the target text by the user to obtain an initial recording;
a recording environment detection unit 502, configured to detect a recording environment of the initial recording; and/or, a recording quality detection unit 503, configured to detect the recording quality of the initial recording;
an initial recording determining unit 504 configured to determine whether the detection result of the recording environment and/or the recording quality is qualified;
a target recording obtaining unit 505, configured to, if a detection result of the recording environment and/or the recording quality of the initial recording is qualified, take the initial recording as a target recording, and keep the target recording;
an initial recording discarding unit 506, configured to discard the initial recording if the detection result of the recording environment and/or the recording quality of the initial recording is not qualified.
In an implementation manner of this embodiment, the apparatus 500 further includes:
a re-recording prompt output unit for outputting a prompt for re-recording the target text;
and the recording step executing unit is used for triggering the initial recording obtaining unit to record the reading voice if the situation that the target text is read again by the user is detected after the prompt is output.
In an implementation manner of this embodiment, the recording environment detecting unit 502 includes:
a voice segment dividing subunit, configured to divide the initial recording into voice segments and non-voice segments;
the signal-to-noise ratio calculating subunit is used for calculating the signal-to-noise ratio of the voice segment;
accordingly, the initial recording determining unit 504 includes:
a first signal-to-noise ratio judging subunit, configured to judge whether a signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold;
the first qualification determining subunit is configured to determine that the detection result of the recording environment is qualified if the number of signal-to-noise ratios greater than the first signal-to-noise ratio threshold reaches a first preset ratio;
and the first disqualification determining subunit is used for determining that the detection result of the recording environment is disqualified if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion.
In an implementation manner of this embodiment, the initial recording determining unit 504 further includes:
a signal-to-noise ratio average value obtaining subunit, configured to obtain an average value of signal-to-noise ratios of at least one recorded sound recording before the initial sound recording, as a signal-to-noise ratio average value, if the initial sound recording is not a first sound recording of the current sound recording;
a second signal-to-noise ratio judging subunit, configured to judge whether an absolute value of a difference between the signal-to-noise ratio of the voice segment and the average value of the signal-to-noise ratio is greater than a preset second signal-to-noise ratio threshold;
the second qualification subunit is used for executing the step of determining that the detection result of the recording environment is unqualified if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion;
and the second unqualified determination subunit is used for executing the step of determining that the detection result of the recording environment is qualified if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion.
In an implementation manner of this embodiment, the recording quality detecting unit 503 includes:
the recognition text acquisition subunit is used for carrying out voice recognition on the initial recording to obtain a recognition text;
a text correctness determining subunit, configured to determine a text correctness of the recognized text, where the text correctness is a ratio of a matching text to the target text, and the matching text is a text content in the recognized text that matches the target text;
accordingly, the initial recording determining unit 504 includes:
the text correct rate judging subunit is used for judging whether the text correct rate is greater than a preset correct rate threshold value;
a third qualification determining subunit, configured to determine that the detection result of the recording quality is qualified if the text correctness is greater than a preset correctness threshold;
and the fourth unqualified determination subunit is used for determining that the detection result of the recording quality is unqualified if the text accuracy is not greater than a preset accuracy threshold.
In an implementation manner of this embodiment, the apparatus 500 further includes:
and the energy warping unit is used for performing energy warping on the initial recording so that the energy change between the initial recording and other recorded recordings tends to be stable.
In one implementation manner of this embodiment, the energy normalization unit includes:
the amplitude value determining subunit is used for determining the amplitude value of each sampling point in the initial sound recording and sequencing the amplitude values from large to small;
the average value operator unit is used for acquiring at least two amplitude values which are sequenced in the front and calculating the average value of the at least two amplitude values;
a first coefficient determining subunit, configured to obtain an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value if the average value is greater than or equal to the preset upper limit value of the amplitude value;
the second coefficient determining subunit is used for obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value if the average value is smaller than the preset lower limit value of the amplitude value;
and the energy warping subunit is used for performing energy warping on the initial recording by utilizing the energy warping coefficient.
In an implementation manner of this embodiment, if the target text is a to-be-recorded text in a pre-constructed recording text set, the apparatus 500 further includes:
the first text set forming unit is used for splitting the collected original text corpora into unit texts to form a first text set;
a second text set forming unit, configured to select a preset number of unit texts from the first text set to form a second text set, where a text component ratio of the second text set is equal to or approximate to that of the first text set;
and the recording text set forming unit is used for taking each unit text in the second text set as a text to be recorded to form a recording text set.
In one implementation manner of this embodiment, the text to be recorded is a text that is subjected to a character replacement operation or is not subjected to the character replacement operation, and the character replacement operation is an operation of replacing uncommon words with common words.
Sixth embodiment
In this embodiment, another voice recording apparatus will be described, and for related contents, please refer to the above method embodiment.
Referring to fig. 6, a schematic diagram of a hardware structure of a voice recording apparatus provided in this embodiment, the voice recording apparatus 600 includes a memory 601 and a receiver 602, and a processor 603 connected to the memory 601 and the receiver 602 respectively, where the memory 601 is configured to store a set of program instructions, and the processor 603 is configured to call the program instructions stored in the memory 601 to perform the following operations:
recording the reading voice in the process of reading the target text by the user to obtain an initial recording;
detecting the recording environment and/or the recording quality of the initial recording;
judging whether the detection result of the recording environment and/or the recording quality is qualified or not;
if so, taking the initial recording as a target recording, and keeping the target recording;
and if not, discarding the initial recording.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
outputting a prompt for re-recording the target text;
and after the prompt is output, if the user is detected to read the target text again, continuing to execute the step of recording the read voice.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
segmenting the initial recording into speech segments and non-speech segments;
calculating the signal-to-noise ratio of the voice segment;
judging whether the signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold reaches a first preset proportion, determining that the detection result of the recording environment is qualified;
and if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion, determining that the detection result of the recording environment is unqualified.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
if the initial recording is not the first recording of the current recording, acquiring the average value of the signal-to-noise ratio of at least one recorded recording before the initial recording as the average value of the signal-to-noise ratio;
judging whether the absolute value of the difference between the signal-to-noise ratio of the voice segment and the mean value of the signal-to-noise ratio is larger than a preset second signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion, executing the step of determining that the detection result of the recording environment is unqualified;
and if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion, executing the step of determining that the detection result of the recording environment is qualified.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
carrying out voice recognition on the initial recording to obtain a recognition text;
determining a text correctness rate of the recognized text, wherein the text correctness rate is a ratio of matched text to the target text, and the matched text is text content matched with the target text in the recognized text;
judging whether the text accuracy is greater than a preset accuracy threshold value or not;
if so, determining that the detection result of the recording quality is qualified;
and if not, determining that the detection result of the recording quality is unqualified.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
and performing energy normalization on the initial recording to enable energy variation between the initial recording and other recorded recordings to tend to be smooth.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
determining the amplitude value of each sampling point in the initial sound recording, and sequencing the amplitude values from large to small;
acquiring at least two amplitude values sequenced in the front, and calculating the average value of the at least two amplitude values;
if the average value is greater than or equal to a preset upper limit value of the amplitude value, obtaining an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value;
if the average value is smaller than a preset lower limit value of the amplitude value, obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value;
and performing energy warping on the initial recording by using the energy warping coefficient.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
splitting the collected original text corpus into unit texts to form a first text set;
selecting a preset number of unit texts from the first text set to form a second text set, wherein the second text set is equal to or similar to the first text set in terms of text component proportion;
and taking each unit text in the second text set as a text to be recorded to form a recording text set.
In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:
the text to be recorded is a text subjected to character replacement operation or not subjected to the character replacement operation, and the character replacement operation is an operation of replacing rare words with common words.
In some embodiments, the processor 603 may be a Central Processing Unit (CPU), the Memory 601 may be a Random Access Memory (RAM) type internal Memory, and the receiver 602 may include a common physical interface, which may be an Ethernet (Ethernet) interface or an Asynchronous Transfer Mode (ATM) interface. The processor 603, receiver 602, and memory 601 may be integrated into one or more separate circuits or hardware, such as: application Specific Integrated Circuit (ASIC).
Further, this embodiment also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation manner of the voice recording method.
Still further, this embodiment further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device is enabled to execute any implementation manner of the voice recording method.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (18)

1. A method for voice recording, comprising:
recording the reading voice in the process of reading the target text by the user to obtain an initial recording;
detecting the recording environment and the recording quality of the initial recording; detecting the recording environment of the initial recording, including detecting the signal-to-noise ratio of the voice fragment of the initial recording, and detecting the difference value between the signal-to-noise ratio of the voice fragment of the initial recording and the mean value of the signal-to-noise ratios of the recorded recordings, wherein the difference value is used for reflecting the change condition of the recording environment;
judging whether the detection results of the recording environment and the recording quality are qualified or not;
if so, taking the initial recording as a target recording, and keeping the target recording;
and if not, discarding the initial recording.
2. The method of claim 1, wherein after discarding the initial audio recording, further comprising:
outputting a prompt for re-recording the target text;
and after the prompt is output, if the user is detected to read the target text again, continuing to execute the step of recording the read voice.
3. The method of claim 1, wherein the detecting the recording environment of the initial recording comprises:
segmenting the initial recording into speech segments and non-speech segments;
calculating the signal-to-noise ratio of the voice segment;
correspondingly, the judging whether the detection result of the recording environment is qualified includes:
judging whether the signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold reaches a first preset proportion, determining that the detection result of the recording environment is qualified;
and if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion, determining that the detection result of the recording environment is unqualified.
4. The method of claim 3, wherein if the number of snrs greater than the first snr threshold reaches a first predetermined ratio, further comprising:
if the initial recording is not the first recording of the current recording, acquiring the average value of the signal-to-noise ratio of at least one recorded recording before the initial recording as the average value of the signal-to-noise ratio;
judging whether the absolute value of the difference between the signal-to-noise ratio of the voice segment and the mean value of the signal-to-noise ratio is larger than a preset second signal-to-noise ratio threshold value or not;
if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion, executing the step of determining that the detection result of the recording environment is unqualified;
and if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion, executing the step of determining that the detection result of the recording environment is qualified.
5. The method of claim 1, wherein said detecting the recording quality of the initial audio recording comprises:
carrying out voice recognition on the initial recording to obtain a recognition text;
determining a text correctness rate of the recognized text, wherein the text correctness rate is a ratio of matched text to the target text, and the matched text is text content matched with the target text in the recognized text;
correspondingly, the judging whether the detection result of the recording quality is qualified includes:
judging whether the text accuracy is greater than a preset accuracy threshold value or not;
if so, determining that the detection result of the recording quality is qualified;
and if not, determining that the detection result of the recording quality is unqualified.
6. The method of any of claims 1 to 5, wherein prior to detecting the recording environment and the recording quality of the initial recording, further comprising:
and performing energy normalization on the initial recording to enable energy variation between the initial recording and other recorded recordings to tend to be smooth.
7. The method of claim 6, wherein the energy warping the initial audio recording comprises:
determining the amplitude value of each sampling point in the initial sound recording, and sequencing the amplitude values from large to small;
acquiring at least two amplitude values sequenced in the front, and calculating the average value of the at least two amplitude values;
if the average value is greater than or equal to a preset upper limit value of the amplitude value, obtaining an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value;
if the average value is smaller than a preset lower limit value of the amplitude value, obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value;
and performing energy warping on the initial recording by using the energy warping coefficient.
8. The method according to any one of claims 1 to 5, wherein the target text is a text to be recorded in a pre-constructed recorded text set, and the recorded text set is constructed as follows:
splitting the collected original text corpus into unit texts to form a first text set;
selecting a preset number of unit texts from the first text set to form a second text set, wherein the second text set is equal to or similar to the first text set in terms of text component proportion;
and taking each unit text in the second text set as a text to be recorded to form a recording text set.
9. The method according to claim 8, wherein the text to be recorded is text with or without a character replacement operation, and the character replacement operation is an operation of replacing uncommon words with common words.
10. A voice recording apparatus, comprising:
the initial recording acquisition unit is used for recording the reading voice in the process of reading the target text by the user to obtain an initial recording;
the recording environment detection unit is used for detecting the recording environment of the initial recording; and a recording quality detection unit for detecting the recording quality of the initial recording; detecting the recording environment of the initial recording, including detecting the signal-to-noise ratio of the voice fragment of the initial recording, and detecting the difference value between the signal-to-noise ratio of the voice fragment of the initial recording and the mean value of the signal-to-noise ratios of the recorded recordings, wherein the difference value is used for reflecting the change condition of the recording environment;
the initial recording judging unit is used for judging whether the detection results of the recording environment and the recording quality are qualified or not;
a target recording obtaining unit, configured to, if the detection results of the recording environment and the recording quality of the initial recording are qualified, take the initial recording as a target recording, and retain the target recording;
and the initial recording discarding unit is used for discarding the initial recording if the detection results of the recording environment and the recording quality of the initial recording are unqualified.
11. The apparatus of claim 10, wherein the recording environment detection unit comprises:
a voice segment dividing subunit, configured to divide the initial recording into voice segments and non-voice segments;
the signal-to-noise ratio calculating subunit is used for calculating the signal-to-noise ratio of the voice segment;
correspondingly, the initial recording judgment unit comprises:
a first signal-to-noise ratio judging subunit, configured to judge whether a signal-to-noise ratio of the voice segment is greater than a preset first signal-to-noise ratio threshold;
the first qualification determining subunit is configured to determine that the detection result of the recording environment is qualified if the number of signal-to-noise ratios greater than the first signal-to-noise ratio threshold reaches a first preset ratio;
and the first disqualification determining subunit is used for determining that the detection result of the recording environment is disqualified if the number of the signal-to-noise ratios larger than the first signal-to-noise ratio threshold value does not reach a first preset proportion.
12. The apparatus of claim 11, wherein the initial recording determining unit further comprises:
a signal-to-noise ratio average value obtaining subunit, configured to obtain an average value of signal-to-noise ratios of at least one recorded sound recording before the initial sound recording, as a signal-to-noise ratio average value, if the initial sound recording is not a first sound recording of the current sound recording;
a second signal-to-noise ratio judging subunit, configured to judge whether an absolute value of a difference between the signal-to-noise ratio of the voice segment and the average value of the signal-to-noise ratio is greater than a preset second signal-to-noise ratio threshold;
the second qualification subunit is used for executing the step of determining that the detection result of the recording environment is unqualified if the number of the signal-to-noise ratios which are larger than the second signal-to-noise ratio threshold reaches a second preset proportion;
and the second unqualified determination subunit is used for executing the step of determining that the detection result of the recording environment is qualified if the number of the signal-to-noise ratios larger than the second signal-to-noise ratio threshold value does not reach a second preset proportion.
13. The apparatus of claim 10, wherein the recording quality detection unit comprises:
the recognition text acquisition subunit is used for carrying out voice recognition on the initial recording to obtain a recognition text;
a text correctness determining subunit, configured to determine a text correctness of the recognized text, where the text correctness is a ratio of a matching text to the target text, and the matching text is a text content in the recognized text that matches the target text;
correspondingly, the initial recording judgment unit comprises:
the text correct rate judging subunit is used for judging whether the text correct rate is greater than a preset correct rate threshold value;
a third qualification determining subunit, configured to determine that the detection result of the recording quality is qualified if the text correctness is greater than a preset correctness threshold;
and the fourth unqualified determination subunit is used for determining that the detection result of the recording quality is unqualified if the text accuracy is not greater than a preset accuracy threshold.
14. The apparatus of any one of claims 10 to 13, further comprising:
and the energy warping unit is used for performing energy warping on the initial recording so that the energy change between the initial recording and other recorded recordings tends to be stable.
15. The apparatus of claim 14, wherein the energy normalization unit comprises:
the amplitude value determining subunit is used for determining the amplitude value of each sampling point in the initial sound recording and sequencing the amplitude values from large to small;
the average value operator unit is used for acquiring at least two amplitude values which are sequenced in the front and calculating the average value of the at least two amplitude values;
a first coefficient determining subunit, configured to obtain an energy warping coefficient smaller than 1 according to the average value and the upper limit value of the amplitude value if the average value is greater than or equal to the preset upper limit value of the amplitude value;
the second coefficient determining subunit is used for obtaining an energy warping coefficient larger than 1 according to the average value and the lower limit value of the amplitude value if the average value is smaller than the preset lower limit value of the amplitude value;
and the energy warping subunit is used for performing energy warping on the initial recording by utilizing the energy warping coefficient.
16. The apparatus according to any one of claims 10 to 13, wherein the target text is a text to be recorded in a pre-constructed recorded text set, and the apparatus further comprises:
the first text set forming unit is used for splitting the collected original text corpora into unit texts to form a first text set;
a second text set forming unit, configured to select a preset number of unit texts from the first text set to form a second text set, where a text component ratio of the second text set is equal to or approximate to that of the first text set;
and the recording text set forming unit is used for taking each unit text in the second text set as a text to be recorded to form a recording text set.
17. A voice recording apparatus, comprising: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-9.
18. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-9.
CN201810725856.3A 2018-07-04 2018-07-04 Voice recording method and device Active CN108962284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810725856.3A CN108962284B (en) 2018-07-04 2018-07-04 Voice recording method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810725856.3A CN108962284B (en) 2018-07-04 2018-07-04 Voice recording method and device

Publications (2)

Publication Number Publication Date
CN108962284A CN108962284A (en) 2018-12-07
CN108962284B true CN108962284B (en) 2021-06-08

Family

ID=64485487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810725856.3A Active CN108962284B (en) 2018-07-04 2018-07-04 Voice recording method and device

Country Status (1)

Country Link
CN (1) CN108962284B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493869A (en) * 2018-12-25 2019-03-19 苏州思必驰信息科技有限公司 The acquisition method and system of audio data
CN110751940B (en) * 2019-09-16 2021-06-11 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet
CN110473525B (en) * 2019-09-16 2022-04-05 百度在线网络技术(北京)有限公司 Method and device for acquiring voice training sample
CN112559798B (en) * 2019-09-26 2022-05-17 北京新唐思创教育科技有限公司 Method and device for detecting quality of audio content
CN110728133B (en) * 2019-12-19 2020-05-05 北京海天瑞声科技股份有限公司 Individual corpus acquisition method and individual corpus acquisition device
CN110728994B (en) * 2019-12-19 2020-05-05 北京海天瑞声科技股份有限公司 Voice acquisition method and device of voice library, electronic equipment and storage medium
CN111191005A (en) * 2019-12-27 2020-05-22 恒大智慧科技有限公司 Community query method and system, community server and computer readable storage medium
CN111554307A (en) * 2020-05-20 2020-08-18 浩云科技股份有限公司 Voiceprint acquisition registration method and device
CN111933152B (en) * 2020-10-12 2021-01-08 北京捷通华声科技股份有限公司 Method and device for detecting validity of registered audio and electronic equipment
CN112669880B (en) * 2020-12-16 2023-05-02 北京读我网络技术有限公司 Method and system for adaptively detecting voice ending
CN113241057B (en) * 2021-04-26 2024-06-18 标贝(青岛)科技有限公司 Interactive method, device, system and medium for training speech synthesis model
CN113889096A (en) * 2021-09-16 2022-01-04 北京捷通华声科技股份有限公司 Method and device for analyzing sound library training data
CN114580356A (en) * 2022-01-26 2022-06-03 大连即时智能科技有限公司 Text editing method and system
CN114743567A (en) * 2022-04-12 2022-07-12 维沃移动通信有限公司 Audio data processing method, device and electronic equipment
CN115376524B (en) * 2022-07-15 2023-08-04 荣耀终端有限公司 Voice awakening method, electronic equipment and chip system
CN116527813B (en) * 2023-06-26 2023-08-29 深圳市易赛通信技术有限公司 Recording method of recording watch and recording watch

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458943A (en) * 2008-12-31 2009-06-17 北京中星微电子有限公司 Sound recording control method and sound recording device
CN102811386A (en) * 2011-06-01 2012-12-05 中兴通讯股份有限公司 Recording device, media server, recording method and system
CN103198828A (en) * 2013-04-03 2013-07-10 中金数据系统有限公司 Method and system of construction of voice corpus
CN105096934A (en) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 Method for constructing speech feature library as well as speech synthesis method, device and equipment
US9390719B1 (en) * 2012-10-09 2016-07-12 Google Inc. Interest points density control for audio matching
CN106653029A (en) * 2016-12-02 2017-05-10 广东小天才科技有限公司 Audio batch segmentation method and device
US20170178661A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Automatic self-utterance removal from multimedia files
CN107221319A (en) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 A kind of speech recognition test system and method
CN108172230A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100629509B1 (en) * 2005-05-16 2006-09-28 삼성전자주식회사 Apparatus and method for measuring signal-to-noise ratio of signals read from optical disc
JP5041293B2 (en) * 2008-05-30 2012-10-03 富士電機株式会社 Magnetic recording medium evaluation apparatus and evaluation method thereof
CN101458944B (en) * 2008-12-31 2013-01-09 无锡中星微电子有限公司 A recording control method and recording device
JP4812881B2 (en) * 2010-01-20 2011-11-09 日立コンシューマエレクトロニクス株式会社 Recording condition adjusting method and optical disc apparatus
CN103684668B (en) * 2012-09-19 2017-04-26 中兴通讯股份有限公司 Method and device for determining CQI (Channel Quality Indicator) value and LTE (Long Term Evolution) terminal
CN105261375B (en) * 2014-07-18 2018-08-31 中兴通讯股份有限公司 Activate the method and device of sound detection
CN106328169B (en) * 2015-06-26 2018-12-11 中兴通讯股份有限公司 A kind of acquisition methods, activation sound detection method and the device of activation sound amendment frame number
CN105513614B (en) * 2015-12-03 2019-05-03 广东顺德中山大学卡内基梅隆大学国际联合研究院 A sound region detection method based on noise power spectrum Gamma distribution statistical model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458943A (en) * 2008-12-31 2009-06-17 北京中星微电子有限公司 Sound recording control method and sound recording device
CN102811386A (en) * 2011-06-01 2012-12-05 中兴通讯股份有限公司 Recording device, media server, recording method and system
US9390719B1 (en) * 2012-10-09 2016-07-12 Google Inc. Interest points density control for audio matching
CN103198828A (en) * 2013-04-03 2013-07-10 中金数据系统有限公司 Method and system of construction of voice corpus
CN105096934A (en) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 Method for constructing speech feature library as well as speech synthesis method, device and equipment
US20170178661A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Automatic self-utterance removal from multimedia files
CN106653029A (en) * 2016-12-02 2017-05-10 广东小天才科技有限公司 Audio batch segmentation method and device
CN107221319A (en) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 A kind of speech recognition test system and method
CN108172230A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model

Also Published As

Publication number Publication date
CN108962284A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108962284B (en) Voice recording method and device
Shriberg et al. Prosody-based automatic segmentation of speech into sentences and topics
US9466289B2 (en) Keyword detection with international phonetic alphabet by foreground model and background model
CN108986830B (en) Audio corpus screening method and device
CN100371926C (en) Method, apparatus, and program for dialogue, and storage medium including a program stored therein
US7949530B2 (en) Conversation controller
CN107958673B (en) Spoken language scoring method and device
CN112967711B (en) A method, system and storage medium for evaluating spoken language pronunciation in small languages
JP5824829B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
CN109300468B (en) Voice labeling method and device
CN110782875B (en) Voice rhythm processing method and device based on artificial intelligence
JP5506738B2 (en) Angry emotion estimation device, anger emotion estimation method and program thereof
CN110634479B (en) Voice interaction system, processing method thereof, and program thereof
CN110556105B (en) Voice interaction system, processing method thereof, and program thereof
JP2015212732A (en) Sound metaphor recognition device and program
CN113689882B (en) Pronunciation evaluation method, pronunciation evaluation device, electronic equipment and readable storage medium
Hansen et al. Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models
CN111785299B (en) Voice evaluation method, device, equipment and computer storage medium
CN114528812A (en) Voice recognition method, system, computing device and storage medium
CN112992183B (en) Singing smell scoring method and device
CN104900226A (en) Information processing method and device
JP2020008730A (en) Emotion estimation system and program
JP5007401B2 (en) Pronunciation rating device and program
US20140074478A1 (en) System and method for digitally replicating speech
Schuller et al. Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant