CN109275009B

CN109275009B - Method and device for controlling synchronization of audio and text

Info

Publication number: CN109275009B
Application number: CN201811151871.8A
Authority: CN
Inventors: 李全; 孔常青; 王玮; 苏文畅
Original assignee: Anhui Tingjian Technology Co ltd
Current assignee: Anhui Tingjian Technology Co ltd
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2021-10-19
Anticipated expiration: 2038-09-29
Also published as: CN109275009A

Abstract

The embodiment of the invention provides a method and a device for controlling audio and text synchronization, wherein the method comprises the following steps: acquiring a target audio and an identification text converted from the target audio; receiving editing operation on the identification text to obtain an edited text of the identification text; determining a target text according to the identification text and the edited text; and controlling the target text and the target audio to be synchronized. The embodiment of the invention realizes the resynchronization process of the target text and the target audio in the editing process of the recognized text.

Description

Method and device for controlling synchronization of audio and text

Technical Field

The invention relates to the technical field of voice, in particular to a method and a device for controlling audio and text synchronization.

Background

With the popularization of internet applications, there are more and more scenes in which text corresponding to audio needs to be edited in real time, such as editing video subtitles or sorting meeting records with high requirements on time axis accuracy. However, since the audio and the text recognized by the audio are aligned, if the recognized text is edited, the edited text is not aligned with the original audio, which is inconvenient for some scenes, such as scenes in which the text corresponding to the audio and the audio needs to be played back and displayed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for controlling synchronization of audio and text, and aims to solve the problem that the edited text and the audio cannot be realigned after the identification text of the audio is edited in the prior art.

In view of the foregoing problems, in a first aspect, an embodiment of the present invention provides a method for controlling audio and text synchronization, where the method includes:

acquiring a target audio and an identification text converted from the target audio;

receiving editing operation on the identification text to obtain an edited text of the identification text;

determining a target text according to the identification text and the edited text;

and controlling the target text and the target audio to be synchronized.

In a second aspect, an embodiment of the present invention provides an apparatus for controlling audio and text synchronization, where the apparatus includes:

the first acquisition module is used for acquiring a target audio and an identification text converted from the target audio;

the second acquisition module is used for receiving the editing operation of the identification text to obtain an edited text of the identification text;

the determining module is used for determining a target text according to the identification text and the edited text;

and the control module is used for controlling the target text and the target audio to be synchronous.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for controlling audio and text synchronization when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for controlling audio and text synchronization.

According to the method and the device for controlling the synchronization of the audio and the text, when the editing operation of the identification text obtained by converting the target audio is received, the edited text of the identification text is obtained, the target text for synchronizing with the target audio is determined according to the identification text and the edited text, and then the target text and the target audio are controlled to be synchronized, so that the problem that the edited text is different from the audio due to the editing error of a user, and the audio cannot be accurately synchronized is solved, the target text obtained by identifying the text and the edited text can be directly re-synchronized with the target audio in the editing process of the identification text, and the real-time synchronization process of the target text and the target audio in the editing process of the identification text is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating the steps of a method for controlling audio and text synchronization in an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps in determining a target text based on a recognized text and an edited text in an embodiment of the present invention;

FIG. 3 shows a flow chart of the steps of a method of controlling audio to text synchronization after step 202 in FIG. 2;

FIG. 4 is a block diagram of an apparatus for controlling audio and text synchronization according to an embodiment of the present invention;

fig. 5 shows a block diagram of an electronic device in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, which is a flowchart illustrating steps of a method for controlling audio and text synchronization according to an embodiment of the present invention, the method includes the following steps:

step 101: and acquiring the target audio and the recognition text converted by the target audio.

In this step, specifically, the present embodiment may convert the target audio into the recognition text through the speech-to-text system, and obtain the recognition text.

Of course, the recognized text is synchronized with the target audio.

Specifically, the target audio may be a recorded voice or voice data generated in real time, where the speech generated when the user speaks, the audio played by the terminal in real time, and the audio without deterministic content, such as the language output by the intelligent robot in real time, may all be determined as the voice data generated in real time.

The voice-to-word system can convert input audio into an identification text carrying time information, the target audio and the identification text obtained by converting the target audio can be regarded as an array with the time information, one array represents a section of audio, and each array comprises an audio number, the start-stop time of the audio and the identification text synchronous with the audio. This can be exemplified as follows. For example, a description of a sentence of audio may be as follows:

{ audio number: 1;

audio start-stop time (unit ms): [0, 1740 ];

recognizing the text: "the king teacher knows"

}

In this way, by acquiring the target audio and the recognition text converted from the target audio in the above description, the synchronism of the target audio and the recognition text converted from the target audio is ensured.

Step 102: and receiving the editing operation of the identification text to obtain the edited text of the identification text.

In this step, specifically, after the identification text obtained by converting the target audio is obtained, the user terminal may perform an editing operation on the identification text, and at this time, the edited text of the identification text is obtained in this embodiment.

It should be noted that the edited text may be a real-time edited text of the recognized text when the target audio is converted into the recognized text, or the edited text may be an edited text obtained when the recognized text is edited at any time after the target audio is converted into the recognized text.

The following exemplifies the recognition text and the edited text.

For example, the text is recognized as: climbing DASHU mountain in daytime if the weather of the bar is not good.

The edited text is: today, the weather is good, and the user climbs the Dashu mountain in tomorrow.

Therefore, the edited text is obtained based on the recognized text, and the accuracy of the edited text relative to the target audio is ensured while the edited text is ensured to meet the requirements of the user.

Step 103: and determining the target text according to the recognized text and the edited text.

In this step, specifically, after the edited text of the recognized text is obtained, the target text for resynchronization with the target audio may be determined according to the recognized text of the target audio and the edited text obtained after editing the recognized text.

Therefore, the edited text is edited by the identification text based on the edited text, and the identification text is synchronous with the target audio, so that when the target text for resynchronization with the target audio is determined by the identification text and the edited text, the problem that the edited text is different from the audio due to the editing error of a user, and the audio cannot be accurately synchronized is solved, and the accuracy of the obtained target text relative to the target audio is ensured.

Step 104: and controlling the target text to be synchronized with the target audio.

In this step, specifically, after the target text for resynchronization with the target audio is determined, the target text and the target audio can be directly controlled to be synchronized, so that the target text obtained according to the recognized text and the edited text of the recognized text can be directly resynchronized with the target audio in the editing process of the recognized text obtained by converting the target audio, the problem that the edited text is different from the audio due to the editing error of the user, and further the audio cannot be accurately synchronized is solved, the synchronization process with the target audio in the editing process of the recognized text is realized, and convenience is provided for the subsequent operation of the target audio.

Of course, it should be noted herein that before controlling the target text and the target audio to be synchronized, the embodiment may also send the target text to the user terminal, so that the user terminal confirms the target text; and then when an instruction that the target text is confirmed to be a correct text by the user terminal is received, executing the step of controlling the target text to be synchronized with the target audio.

Therefore, the target text is sent to the user terminal, and the user finally confirms the target text, so that the condition that errors exist in the target text or the target text does not meet the requirements of the user is avoided, and the satisfaction degree of the user on data obtained after the target text and the target audio are synchronized is improved.

In addition, it should be further noted that, after the target text is sent to the user terminal, when an instruction that the user terminal confirms that the target text is an error text is received, the edited text and the target audio may be controlled to be synchronized. Therefore, when an instruction that the target text is confirmed to be the error text by the user terminal is received, the target text cannot meet the use requirement of the user, the edited text obtained after the user edits the identification text can be directly synchronized with the target audio, the user edits autonomously based on the edited text, and the satisfaction degree of the user on the data obtained after the target text and the target audio are synchronized is guaranteed.

Thus, the present embodiment can easily recognize the target audio by, when the edited text of the recognized text converted from the target audio is obtained, the target text is determined according to the recognized text and the edited text, and the target text is controlled to be synchronized with the target audio, so that the target text is directly controlled to be synchronized with the target audio after the target text for resynchronization with the target audio is determined, thereby realizing that the target text obtained according to the identification text and the edited text of the identification text can be directly resynchronized with the target audio in the process of editing the identification text converted from the target audio, avoiding the difference between the edited text and the audio caused by the editing error of the user, therefore, the problem of inaccurate audio synchronization is caused, the synchronization process with the target audio in the text editing process is realized, and convenience is provided for the subsequent operation of the target audio.

Furthermore, a specific process of determining the target text according to the recognition text and the edited text in step 103 is further described herein.

As shown in fig. 2, when determining the target text according to the recognition text and the edited text, the method may include the following steps:

step 201: and acquiring the modified words of the edited text relative to the recognized text.

In this step, specifically, when obtaining the modified text of the edited text relative to the recognized text, an alignment result of the edited text and the recognized text without punctuation may be obtained first; and then obtaining the modified words of the edited text relative to the recognized text according to the alignment result.

Specifically, when the alignment result of the edited text and the recognized text without punctuation is obtained, the alignment result of the edited text and the recognized text with punctuation may be traced back by using an edit distance method and a trace back algorithm, and then the punctuation in the edited text and the recognized text is filtered to obtain the alignment result of the edited text and the recognized text without punctuation.

Of course, it should be noted that the alignment between the edited text and the recognized text may also be dynamically planned, and the alignment is not specifically limited herein.

The above process is exemplified below.

For example, assume that the recognized text is: "go to climb DASHU mountain in daytime and tomorrow. "the edited text obtained by editing the identification text is: "it is good today, go to climb Dashu mountain tomorrow. ". At this time, the alignment result of the edited text and the recognized text without punctuation is obtained as follows:

"go to climb Shushan in tomorrow at no wrong day in bar today";

"today go to climb the Shushan mountain in good weather;

at this time, the modified words of the edited text relative to the recognized text may be "bar" and "o" according to the above alignment result.

Step 202: and obtaining a plurality of first texts to be synchronized according to the changed characters.

In this step, specifically, each of the plurality of first texts to be synchronized does not include punctuation marks.

Specifically, because a user may have an editing error in the process of editing the recognized text, the user may not find the editing error in time, for example, when the user recognizes that the text inserts the word "a certain", the word "a certain" is inserted carelessly, so that the text after being edited has an error; for another example, when a user removes a certain character from the recognized text, the user carelessly deletes the previous character or the next character of the character at the same time, thereby causing an error in editing the text. At this time, the embodiment may obtain a plurality of first texts to be synchronized according to the modified words, so that a target text without an error can be selected from the plurality of first texts to be synchronized.

When a plurality of first texts to be synchronized are obtained according to the modified words, any one or combination of the modified words can be modified in the identification text without punctuation marks according to the modified words to obtain modified first texts; and then determining the recognized text without punctuation marks and the first text as the first text to be synchronized. I.e. the first to-be-synchronized text at that timeThe number of books is 2^NAnd N represents the number of modified words.

Of course, it should be noted here that when any one or a combination of the modified words is modified in the identification text without punctuation marks, the position of the modification in the identification text is the position of the modified word in the identification text.

The following description is continued with reference to the example in step 201.

The modified words "bar" and "o" are obtained in step 201 by way of example. At this time, any one or a combination of the modified words is modified for the identification text ' the day bar weather is not wrong and the day goes to climb the Shushan mountain ' without punctuation marks ', and the modified first text comprises:

' go to climb Dashushan in tomorrow ' day ' in good weather today "

"today's bar weather is good at tomorrow go climb Dashu mountain"

"today's weather is good at tomorrow to climb the Shushan mountain"

The above-mentioned recognized text without punctuation marks and the above-mentioned three first texts may be determined as the first text to be synchronized at this time.

In this way, according to the embodiment, the plurality of first texts to be synchronized are obtained according to the modified characters, so that the target text can be selected from the plurality of first texts to be synchronized, the selectivity of the target text is increased, and the problems that an edited text has an editing error and data error and audio cannot be accurately synchronized when the text with the editing error is directly determined as the target text are solved.

Step 203: and determining a target text from the plurality of first texts to be synchronized.

In this step, specifically, the embodiment may determine the target text from the multiple first texts to be synchronized by using the acoustic model and the language model trained in advance, so as to ensure high efficiency in acquiring the target text and correctness of the acquired target text.

When the target text is determined from the plurality of first texts to be synchronized, each first text to be synchronized in the plurality of first texts to be synchronized and the target audio can be aligned through a preset acoustic model, and an acoustic score of each first text to be synchronized is obtained; obtaining the language score of each first text to be synchronized through a preset language model; and then obtaining a comprehensive score of each first text to be synchronized according to the acoustic score and the language score of each first text to be synchronized, and determining the target text from the plurality of first texts to be synchronized according to the comprehensive score of each first text to be synchronized.

It should be noted that, when obtaining the composite score of each first text to be synchronized according to the acoustic score and the language score of each first text to be synchronized, the acoustic score and the language score of each first text to be synchronized may be weighted respectively, then a sum of the weighted acoustic score and the weighted language score is calculated, and the sum is determined as the composite score of the first text to be synchronized, so as to ensure the reliability of the obtained composite score. Of course, the weights of the acoustic score and the linguistic score may be preset according to application requirements and experience. Of course, the sum of the acoustic score and the linguistic score of the first text to be synchronized may also be directly determined as the composite score, so as to ensure the simplicity of obtaining the composite score. That is, the specific manner of obtaining the confidence score of each first text to be synchronized is not particularly limited herein.

In addition, specifically, when the target text is determined from the plurality of first texts to be synchronized according to the composite score of each first text to be synchronized, the first text to be synchronized with the highest composite score in the plurality of first texts to be synchronized may be determined as the target text. In this way, the target text is determined based on the comprehensive score of each first text to be synchronized, so that the target text is the text with the highest comprehensive accuracy in the plurality of first texts to be synchronized, and the accuracy of the target text is guaranteed.

In this way, in the embodiment, the plurality of first texts to be synchronized are obtained according to the modified text of the edited text relative to the recognized text, and the acoustic model and the language model obtained by pre-training are used to determine the target text from the plurality of first texts to be synchronized, so that the high accuracy of the determined target text is ensured, the accuracy of the target text in audio synchronization is ensured, and the problem that when the text with the editing error is edited and the text with the editing error is directly synchronized, the audio cannot be accurately synchronized is avoided.

Furthermore, there may be editing of punctuation marks during editing of the recognized text by the user, which will be described below.

As shown in fig. 3, after step 202 in fig. 2, the present embodiment further includes the following steps:

step 301: and acquiring punctuation marks in the identification text, and adding the punctuation marks in the identification text to each first text to be synchronized to obtain a plurality of texts before punctuation editing.

The description is continued with the recognized text illustrated in step 201 and the first text to be synchronized illustrated in step 202.

The identification text is' the day bar goes to climb the Sichuan mountain in the daytime if the weather is not good. "punctuation in the recognition text may be obtained as" yes ". "at this time, punctuation marks in the recognized text are added to each first text to be synchronized in step 202 without punctuation marks, so as to obtain a plurality of punctuation edits, which are as follows:

"go to climb DASHU mountain in daytime and tomorrow. "

"go to climb Dashu mountain in daytime but not tomorrow today. "

"today go to climb Dashu mountain in the best day. "

"it is good at day weather to climb Dashu mountain. "

Step 302: and determining the modified punctuation of the edited text relative to each punctuation text before editing according to the plurality of punctuation texts before editing and the edited text after editing.

In this step, specifically, the edited text may be aligned with each pre-punctuation editing text, and according to the alignment result of the edited text and each pre-punctuation editing text, a modified punctuation of the edited text with respect to each pre-punctuation editing text is determined.

Specifically, the modification manner of the modified punctuation may include addition, deletion, modification, and the like.

In addition, specifically, as the punctuation included in the text before punctuation editing is the punctuation mark in the added identification text, the punctuation included in the text before punctuation editing is the same as the punctuation mark in the identification text, at this time, the edited text and the identification text can be directly aligned, and then the punctuation mark edited by the user relative to the identification text is determined according to the alignment result of the edited text and the identification text, that is, the modified punctuation mark of the edited text relative to the text before punctuation editing is determined.

This is explained in continuation of the previous example.

Wherein the edited text is' the weather is good today and the user climbs the Sichuan mountain tomorrow. "the text is identified as" the day bar went to climb the marshmallow mountain in no good weather. "at this time, the modified punctuation of the edited text relative to the text before each punctuation editing can be determined as", ".

Step 303: and performing punctuation editing on the text before each punctuation editing according to the modified punctuation of the edited text relative to the text before each punctuation editing to obtain a plurality of texts after punctuation editing.

In this step, the description is continued with the previous example.

Wherein, punctuation editing the text includes:

"go to climb DASHU mountain in daytime and tomorrow. "

"go to climb Dashu mountain in daytime but not tomorrow today. "

"today go to climb Dashu mountain in the best day. "

"it is good at day weather to climb Dashu mountain. "

The modified punctuation of the edited postamble relative to the text before each punctuation edition is ",", and at this time, punctuation edition is performed on the text before each punctuation edition, so that a plurality of postambles after punctuation edition are obtained as follows:

"day the bar weather is good, tomorrow go to climb the Shushan mountain. "

"day is not wrong, tomorrow goes to climb Dashu mountain. "

"day the weather is good in the bar, go to climb DASHU mountain tomorrow. "

"it is good today, go to climb Dashu mountain tomorrow. "

Step 304: and determining the texts before the multiple punctuations are edited and the texts after the multiple punctuations are edited as a second text to be synchronized.

In the step, the texts before the punctuation editing and the texts after the punctuation editing are determined as the second texts to be synchronized, so that the second texts to be synchronized are related to the modified texts and the modified punctuation, the selectivity of the target texts can be increased when the target texts are selected from the second texts to be synchronized, and the problem of data error when the texts with the editing errors are determined as the target texts directly when the edited texts have editing character errors or editing punctuation errors is avoided.

Step 305: and determining a target text from the plurality of second texts to be synchronized.

In this step, specifically, the target text may be determined from the multiple second texts to be synchronized by using the acoustic model and the language model trained in advance, so as to ensure high efficiency in acquiring the target text and correctness of the acquired target text.

When the target text is determined from the second texts to be synchronized, aligning each text before punctuation editing in the second texts to be synchronized with the target audio through a preset acoustic model to obtain an acoustic score of each text before punctuation editing; obtaining the language score of each second text to be synchronized through a preset language model; obtaining a comprehensive score of each second text to be synchronized according to the acoustic score and the language score of each second text to be synchronized, and determining the target text from a plurality of second texts to be synchronized according to the comprehensive score of each second text to be synchronized; and the acoustic score of each punctuation edited text in the second text to be synchronized is the same as the acoustic score of the text before punctuation editing.

It should be noted that, when the comprehensive score of each second text to be synchronized is obtained according to the acoustic score and the language score of each second text to be synchronized, the acoustic score and the language score of each second text to be synchronized may be weighted respectively, then a sum of the weighted acoustic score and the weighted language score is calculated, and the sum is determined as the comprehensive score of the second text to be synchronized, thereby ensuring the reliability of the obtained comprehensive score. Of course, the weights of the acoustic score and the linguistic score may be preset according to application requirements and experience. Of course, the sum of the acoustic score and the linguistic score of the second text to be synchronized can also be directly determined as the comprehensive score, so that the simplicity of obtaining the comprehensive score is ensured. That is, the specific manner of obtaining the confidence score of each second text to be synchronized is not particularly limited herein.

In addition, specifically, when the target text is determined from the multiple second texts to be synchronized according to the composite score of each second text to be synchronized, the second text to be synchronized with the highest composite score in the multiple second texts to be synchronized may be determined as the target text. In this way, the target text is determined based on the comprehensive score of each second text to be synchronized, so that the target text is the text with the highest comprehensive accuracy in the plurality of second texts to be synchronized, and the accuracy of the target text is guaranteed.

In this way, in this embodiment, a plurality of first texts to be synchronized are obtained according to the modified text of the edited postscript relative to the recognized text, punctuation marks in the recognized text are added to each first text to be synchronized to obtain a plurality of texts before punctuation editing, then the modified punctuation of the edited postscript relative to each text before punctuation editing is determined according to the plurality of texts before punctuation editing and the edited text, so that punctuation editing is performed on each text before punctuation editing according to the modified punctuation of the edited postscript relative to each text before punctuation editing to obtain a plurality of texts after punctuation editing, and finally the plurality of texts before punctuation editing and the plurality of postscript editing are determined as the second text to be synchronized, so that the target text can be determined from the plurality of second texts to be synchronized, the modified text and the punctuation are obtained based on the target text, thereby ensuring high accuracy of the determined target text, the method avoids the data error problem and the inaccuracy problem during audio synchronization, which are generated when the text with the editing error is directly determined as the target text when the edited text has the character or punctuation editing error.

In this way, in the embodiment, when the editing operation of the identification text obtained by converting the target audio is received, the edited text of the identification text is obtained, the target text for synchronizing with the target audio is determined according to the identification text and the edited text, and then the target text and the target audio are controlled to be synchronized, so that the problem that the edited text is different from the audio due to the editing error of a user, and further the audio cannot be accurately synchronized is solved, and the target text and the target audio obtained by the identification text and the edited text can be directly re-synchronized in the editing process of the identification text, so that the real-time synchronization process of the target text and the target audio in the editing process of the identification text is realized.

In addition, as shown in fig. 4, an embodiment of the present invention further provides an apparatus for controlling audio and text synchronization, where the apparatus includes:

a first obtaining module 401, configured to obtain a target audio and an identification text obtained by converting the target audio;

a second obtaining module 402, configured to receive an editing operation on the identification text, to obtain an edited text of the identification text;

a determining module 403, configured to determine a target text according to the recognition text and the edited text;

a control module 404, configured to control the target text and the target audio to be synchronized.

According to the device provided by the embodiment, the target audio and the identification text obtained by converting the target audio are obtained through the first obtaining module, the editing operation on the identification text is received through the second obtaining module, the edited text of the identification text is obtained, then the target text is determined through the determining module according to the identification text and the edited text, and finally the target text and the target audio are controlled to be synchronized through the control module, so that the problem that the edited text is different from the audio due to the editing error of a user, and the audio cannot be accurately synchronized is solved.

Optionally, the determining module 403 includes:

the first acquisition unit is used for acquiring the modified characters of the edited text relative to the recognition text;

a second obtaining unit, configured to obtain a plurality of first texts to be synchronized according to the modified words, where each of the plurality of first texts to be synchronized does not include punctuation marks;

a first determining unit, configured to determine the target text from the plurality of first texts to be synchronized.

Optionally, the first obtaining unit is configured to obtain an alignment result of the edited text and the recognized text without punctuation marks; and obtaining the modified words of the edited text relative to the recognized text according to the alignment result.

Optionally, the second obtaining unit includes:

the first obtaining subunit is configured to, according to the modified words, modify any one or a combination of the modified words in an identification text that does not include punctuation marks, to obtain a modified first text;

and the first determining subunit is used for determining the identification text without punctuation marks and the first text as the first text to be synchronized.

Optionally, the apparatus further comprises:

the third acquisition unit is used for acquiring punctuations in the identification text and adding the punctuations in the identification text to each first text to be synchronized to obtain a plurality of texts before punctuation editing;

a second determining unit, configured to determine, according to the multiple texts before punctuation editing and the edited texts, modified punctuation of the edited text relative to each text before punctuation editing;

a fourth obtaining unit, configured to perform punctuation editing on each pre-punctuation editing text according to a modified punctuation of the post-editing text with respect to each pre-punctuation editing text, so as to obtain multiple post-punctuation editing texts;

a third determining unit, configured to determine the multiple texts before punctuation editing and the multiple texts after punctuation editing as a second text to be synchronized;

and the fourth determining unit is used for determining a target text from the plurality of second texts to be synchronized.

Optionally, the apparatus further comprises:

the sending module is used for sending the target text to a user terminal so that the user terminal confirms the target text;

and the processing module is used for triggering the control module to control the target text and the target audio to be synchronized when receiving an instruction that the user terminal confirms that the target text is the correct text.

Optionally, the apparatus further comprises:

and the synchronization module is used for controlling the edited text to be synchronized with the target audio when an instruction that the target text is confirmed to be an error text by the user terminal is received.

The device provided by this embodiment obtains the edited text of the recognized text when receiving the editing operation on the recognized text obtained by converting the target audio, determines the target text for synchronizing with the target audio according to the recognized text and the edited text, and then controls the target text and the target audio to be synchronized, so that the target text obtained by recognizing the text and the edited text can be directly resynchronized with the target audio in the editing process of the recognized text, and the real-time synchronization process of the target text and the target audio in the editing process of the recognized text is realized.

It should be noted that, in the embodiment of the present invention, the related functional modules may be implemented by a hardware processor (hardware processor), and the same technical effect can be achieved, which is not described herein again.

In yet another embodiment of the present invention, an electronic device is provided, as shown in fig. 5, which includes a memory (memory)501, a processor (processor)502, and a computer program stored on the memory 501 and executable on the processor 502. The memory 501 and the processor 502 are in communication with each other through a bus 503. The processor 502 is configured to call program instructions in the memory 501 to perform the following method: acquiring a target audio and an identification text converted from the target audio; receiving editing operation on the identification text to obtain an edited text of the identification text; determining a target text according to the identification text and the edited text; and controlling the target text and the target audio to be synchronized.

The electronic device provided by the embodiment of the invention can execute the specific steps in the method for controlling the synchronization of the audio and the text, and can achieve the same technical effect, and the specific description is not provided herein.

Further, the program instructions in the memory 501 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In a further embodiment of the invention, a non-transitory computer readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, is operative to perform the method of: acquiring a target audio and an identification text converted from the target audio; receiving editing operation on the identification text to obtain an edited text of the identification text; determining a target text according to the identification text and the edited text; and controlling the target text and the target audio to be synchronized.

The non-transitory computer-readable storage medium provided in the embodiments of the present invention can perform specific steps in a method for controlling audio and text synchronization, and can achieve the same technical effects, which are not described in detail herein.

In yet another embodiment of the present invention, a computer program product is provided, the computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that when executed by a computer perform the method of: acquiring a target audio and an identification text converted from the target audio; receiving editing operation on the identification text to obtain an edited text of the identification text; determining a target text according to the identification text and the edited text; and controlling the target text and the target audio to be synchronized.

The computer program product provided by the embodiment of the invention can execute the specific steps in the method for controlling the synchronization of the audio and the text, and can achieve the same technical effect, and the specific description is not provided herein.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for controlling audio synchronization with text, the method comprising:

controlling the target text and the target audio to be synchronized;

wherein the determining a target text according to the recognition text and the edited text comprises:

acquiring the modified characters of the edited text relative to the recognized text;

obtaining a plurality of first texts to be synchronized according to the modified words, wherein each first text to be synchronized in the plurality of first texts to be synchronized does not contain punctuation marks;

determining the target text from the plurality of first texts to be synchronized;

wherein the determining the target text from the plurality of first texts to be synchronized comprises:

aligning each first text to be synchronized in the plurality of first texts to be synchronized with the target audio through a preset acoustic model, and obtaining an acoustic score of each first text to be synchronized; obtaining the language score of each first text to be synchronized through a preset language model; and obtaining a comprehensive score of each first text to be synchronized according to the acoustic score and the language score of each first text to be synchronized, and determining the target text from the plurality of first texts to be synchronized according to the comprehensive score of each first text to be synchronized.

2. The method of claim 1, wherein obtaining the modified text of the edited text relative to the recognized text comprises:

acquiring an alignment result of the edited text and the recognition text without punctuation marks;

and obtaining the modified words of the edited text relative to the recognized text according to the alignment result.

3. The method of claim 1, wherein obtaining a plurality of first texts to be synchronized according to the modified words comprises:

according to the modified words, in the identification text without punctuation marks, modifying any one or combination of the modified words to obtain a modified first text;

and determining the identification text without punctuation marks and the first text as the first text to be synchronized.

4. The method of claim 1, wherein prior to controlling the target text to synchronize with the target audio, the method further comprises:

sending the target text to a user terminal so that the user terminal confirms the target text;

and when an instruction that the target text is confirmed to be a correct text by the user terminal is received, executing the step of controlling the target text and the target audio to be synchronous.

5. The method of claim 4, wherein after sending the target text to the user terminal, the method further comprises:

and when an instruction that the target text is confirmed to be an error text by the user terminal is received, controlling the edited text to be synchronous with the target audio.

6. An apparatus for controlling audio and text synchronization, the apparatus comprising:

the control module is used for controlling the target text and the target audio to be synchronous;

wherein the determining module comprises:

a first determining unit, configured to determine the target text from the plurality of first texts to be synchronized;

wherein the first determining unit is specifically configured to:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of controlling audio and text synchronization according to any of claims 1 to 5 when executing the computer program.

8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of controlling audio and text synchronization according to any one of claims 1 to 5.