CN115937628A

CN115937628A - Model training data sample acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN115937628A
Application number: CN202211530092.5A
Authority: CN
Inventors: 洪煜中
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-07

Abstract

The embodiment of the application provides a method and a device for obtaining model training data samples, electronic equipment and a storage medium, and relates to the technical field of intelligent analysis, wherein the method comprises the following steps: acquiring a source video and an explanation video corresponding to the source video, wherein the explanation video comprises an image and a text for explaining the source video; obtaining a matching result of the text and the source video clip based on the video clip of the description video and the source video which are matched with each other and the text of the description video; and generating a model training data sample corresponding to the matching result. By the method, the time consumption for obtaining the model training data sample is shortened.

Description

Model training data sample obtaining method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of intelligent analysis technologies, and in particular, to a method and an apparatus for obtaining model training data samples, an electronic device, and a storage medium.

Background

Searching, recommending, authoring and the like in the video field generally need training of multi-modal models, which generally need a large amount of text and video matching data. In the prior art, a large amount of manpower is consumed for labeling the data, so that the time for acquiring the training data is long.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for obtaining model training data samples, an electronic device, and a storage medium, so as to shorten the time consumed for obtaining model training data samples. The specific technical scheme is as follows:

in a first aspect of this application, a method for obtaining model training data samples is provided, including:

acquiring a source video and an explanation video corresponding to the source video, wherein the explanation video comprises an image and a text for explaining the source video;

obtaining a matching result of the text and the source video clip based on the video clip of the description video and the source video which are matched with each other and the text of the description video;

and generating a model training data sample corresponding to the matching result.

Optionally, the obtaining a matching result of the text and the source video segment based on the video segment in which the description video and the source video are matched with each other and the text of the description video includes:

determining matching segments of the source video and the description video based on similarity of frame images in the source video and frame images in the description video, wherein the matching segments comprise a first segment and a second segment which are matched with each other, and the first segment is a video segment in the description video; the second segment is a video segment in the source video;

performing text recognition on the frame image in the first segment to obtain a text corresponding to the first segment;

and matching the text with the second fragment to obtain a matching result.

determining the corresponding relation between the description video time interval and the text in the description video based on the text of the frame image in the description video;

determining a time interval of the description video and the source video which are matched with each other based on the similarity of the frame images in the source video and the frame images in the description video, and obtaining a corresponding relation between the time interval of the description video and the time interval of the source video;

determining the corresponding relation between the source video time interval and the text in the description type video according to the corresponding relation between the description type video time interval and the source video time interval and the corresponding relation between the description type video time interval and the text in the description type video;

and acquiring a source video clip corresponding to the source video time interval from the source video to obtain a matching result of the source video clip and the text.

Optionally, the generating a model training data sample corresponding to the matching result includes:

generating model training data corresponding to one frame of image in the source video clip and model training data corresponding to the text matched with the one frame of image to form a model training data sample; or;

and combining the multi-frame images in the source video clip based on one frame of image in the source video clip to generate model training data corresponding to the multi-frame images in the source video clip and model training data corresponding to the text matched with the multi-frame images to form a model training data sample.

Optionally, the determining, based on the similarity between the frame image in the source video and the frame image in the description video, a time interval in which the description video and the source video are matched with each other, and obtaining a correspondence between the time interval of the description video and the time interval of the source video includes:

for each frame image in a plurality of frame images in the description type video, retrieving a similar frame image with the similarity of the frame image not less than a preset similarity threshold from the source video;

if the similar frame images corresponding to the adjacent frame images in the description video meet preset similar conditions, merging the time of the adjacent frame images to obtain a first description video time interval, and merging the time of the similar frame images to obtain a source video time interval, wherein the adjacent frame images are temporally adjacent frame images in the description video;

and establishing a corresponding relation between the first description type video time interval and the source video time interval.

Optionally, the determining, based on the text of the frame image in the description video, a correspondence between a time interval of the description video and the text in the description video includes:

performing text recognition on each frame image in a plurality of frame images in the description video by using a text recognition algorithm to obtain a text recognition result corresponding to the frame image;

if the text recognition results corresponding to the adjacent frame images in the description video meet preset matching conditions, unifying the text recognition results corresponding to the adjacent frame images into a target text;

merging the time of the adjacent frame images to obtain a second description video time interval;

and establishing a corresponding relation between the second description type video time interval and the target text.

Optionally, the determining the correspondence between the source video time interval and the text in the description type video according to the correspondence between the description type video time interval and the source video time interval and the correspondence between the description type video time interval and the text in the description type video includes:

for each time interval in at least one time interval included in the second description type video time interval, calculating the coincidence degree of the time interval and each time interval included in the first description type video time interval, wherein the coincidence degree is the ratio of the intersection time interval length to the union time interval length;

and if the contact ratio is not less than a preset contact ratio threshold value, establishing the corresponding relation between the source video time interval and the text according to the corresponding relation between the first description type video time interval and the source video time interval and the corresponding relation between the second description type video time interval and the text.

Optionally, the step of satisfying the preset similarity condition by the similar frame images corresponding to the adjacent frame images in the description type video includes: the similar frame images corresponding to the adjacent frame images in the description video have the same frame image, or the time difference of the similar frame images corresponding to the adjacent frame images in the description video is smaller than a preset time difference value.

Optionally, the step of enabling the text recognition result corresponding to the adjacent frame image in the description video to satisfy the preset matching condition includes: the text recognition results corresponding to the adjacent frame images in the description video are the same, or the matching degree between the text recognition results corresponding to the adjacent frame images in the description video is not greater than the preset matching degree.

Optionally, unifying the text recognition results corresponding to the adjacent frame images into the target text includes:

if the character strings in the text recognition results corresponding to the adjacent frame images are different, counting the occurrence frequency of the different character strings, and taking the text recognition result corresponding to the character string with the largest occurrence frequency as the text corresponding to the adjacent frame images.

In a second aspect of this application, there is also provided a model training data sample acquiring apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a source video and a description video corresponding to the source video, and the description video comprises an image and a text for describing the source video;

the matching module is used for obtaining a matching result of the text and the source video clip based on the video clip of the description video and the source video which are matched with each other and the text of the description video;

and the generating module is used for generating a model training data sample corresponding to the matching result.

Optionally, the matching module includes:

the first determining sub-module is used for determining matching segments of the source video and the description video based on the similarity of the frame images in the source video and the frame images in the description video, wherein the matching segments comprise a first segment and a second segment which are matched with each other, and the first segment is a video segment in the description video; the second segment is a video segment in the source video;

the first obtaining submodule is used for carrying out text recognition on the frame image in the first segment to obtain a text corresponding to the first segment;

and the second obtaining sub-module is used for matching the text with the second fragment to obtain a matching result.

Optionally, the matching module includes:

the second determining submodule is used for determining the corresponding relation between the description video time interval and the text in the description video based on the text of the frame image in the description video;

a third determining submodule, configured to determine, based on similarity between frame images in the source video and frame images in the description video, a time interval in which the description video and the source video are matched with each other, and obtain a correspondence between the description video time interval and the source video time interval;

a fourth determining submodule, configured to determine a correspondence between the source video time interval and the text in the description type video according to a correspondence between the description type video time interval and the source video time interval and a correspondence between the description type video time interval and the text in the description type video;

and the third obtaining submodule is used for obtaining a source video clip corresponding to the source video time interval from the source video to obtain a matching result of the source video clip and the text.

Optionally, the matching module includes:

a fifth determining submodule, configured to determine, based on similarity between frame images in the source video and frame images in the description video, a time interval in which the description video and the source video are matched with each other, and obtain a correspondence between the time interval of the description video and the time interval of the source video;

a sixth determining submodule, configured to determine, based on the text of the frame image in the description video, a correspondence between a description video time interval and the text in the description video;

a seventh determining sub-module, configured to determine a correspondence between the source video time interval and the text in the description type video according to a correspondence between the description type video time interval and the source video time interval and a correspondence between the description type video time interval and the text in the description type video;

and the fourth obtaining submodule is used for obtaining a source video clip corresponding to the source video time interval from the source video to obtain a matching result of the source video clip and the text.

Optionally, the generating module includes:

the first generation submodule is used for generating model training data corresponding to one frame of image in the source video clip and model training data corresponding to the text matched with the one frame of image to form a model training data sample; or;

and the second generation submodule is used for combining the multi-frame images in the source video clip based on one frame of image in the source video clip, generating model training data corresponding to the multi-frame images in the source video clip and model training data corresponding to the text matched with the multi-frame images, and forming a model training data sample.

Optionally, the fifth determining sub-module is specifically configured to: for each frame image in a plurality of frame images in the description type video, retrieving a similar frame image with the similarity of the frame image not less than a preset similarity threshold from the source video; if the similar frame images corresponding to the adjacent frame images in the description video meet preset similar conditions, merging the time of the adjacent frame images to obtain a first description video time interval, and merging the time of the similar frame images to obtain a source video time interval, wherein the adjacent frame images are temporally adjacent frame images in the description video; and establishing a corresponding relation between the first description type video time interval and the source video time interval.

Optionally, the sixth determining submodule is specifically configured to: performing text recognition on each frame image in a plurality of frame images in the description video by using a text recognition algorithm to obtain a text recognition result corresponding to the frame image; if the text recognition results corresponding to the adjacent frame images in the description video meet preset matching conditions, unifying the text recognition results corresponding to the adjacent frame images into a target text; merging the time of the adjacent frame images to obtain a second description video time interval; and establishing a corresponding relation between the second description type video time interval and the target text.

Optionally, the seventh determining sub-module is specifically configured to: for each time interval in at least one time interval included in the second description type video time interval, calculating the coincidence degree of the time interval and each time interval included in the first description type video time interval, wherein the coincidence degree is the ratio of the intersection time interval length to the union time interval length; and if the contact ratio is not less than a preset contact ratio threshold value, establishing a corresponding relation between the source video time interval and the text according to the corresponding relation between the first description type video time interval and the source video time interval and the corresponding relation between the second description type video time interval and the text.

Optionally, the step of satisfying the preset similarity condition by the similar frame images corresponding to the adjacent frame images in the description type video includes: similar frame images corresponding to adjacent frame images in the description type video have the same frame image, or the time difference of the similar frame images corresponding to the adjacent frame images in the description type video is smaller than a preset time difference value.

Optionally, the sixth determining submodule is specifically configured to:

In a third aspect of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the method for obtaining model training data samples according to any one of the first aspect described above when executing a program stored in a memory.

In a fourth aspect implemented by the present application, there is further provided a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the model training data sample acquiring method according to any one of the first aspect.

In yet another aspect of this embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the model training data sample acquisition method of any one of the above first aspects.

The embodiment of the application provides a method and a device for obtaining model training data samples, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a source video and an explanation video corresponding to the source video, wherein the explanation video comprises an image and a text for explaining the source video; obtaining a matching result of the text and the source video clip based on the video clip matched with the description video and the source video and the text of the description video; and generating a model training data sample corresponding to the matching result. Matching the text of the description video with the corresponding source video fragment to form a text video matching pair, and providing a text and video matching pair data sample for the training of a multi-modal model; and the source video clips and the corresponding texts can be automatically extracted, so that the time consumption for obtaining model training data samples is shortened.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a first flowchart of a method for obtaining model training data samples according to an embodiment of the present disclosure;

fig. 2 is a second flowchart of a method for obtaining model training data samples according to an embodiment of the present disclosure;

FIG. 3a is a third schematic flowchart of a method for obtaining model training data samples according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of an actual application of the method for obtaining model training data samples according to the embodiment of the present application;

fig. 4 is a fourth flowchart illustrating a method for obtaining model training data samples according to an embodiment of the present disclosure;

fig. 5 is a fifth flowchart illustrating a method for obtaining model training data samples according to an embodiment of the present disclosure;

fig. 6 is a sixth flowchart illustrating a method for obtaining model training data samples according to an embodiment of the present disclosure;

fig. 7 is a seventh flowchart illustrating a method for obtaining model training data samples according to an embodiment of the present disclosure;

fig. 8 is an eighth flowchart illustrating a method for obtaining a model training data sample according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a model training data sample acquisition device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In order to shorten the time consumption for obtaining model training data samples, embodiments of the present application provide a method and an apparatus for obtaining model training data samples, an electronic device, and a storage medium.

Matching the text of the description video with the corresponding source video fragment to form a text video matching pair, and providing a text and video matching pair data sample for the training of a multi-modal model; and the source video clips and the corresponding texts can be automatically extracted, so that the time consumption for obtaining model training data samples is shortened.

Firstly, a method for obtaining model training data samples provided by the embodiment of the present application is described in detail, with reference to fig. 1, including the following steps:

step S101, a source video and a description video corresponding to the source video are obtained, wherein the description video comprises an image and a text for describing the source video.

The source video may be a movie video, a variety video, a television play video, and the like, and the description video may be an explanation video, a recording video, and the like, which is not specifically limited in this application. The description video includes images and texts for describing a source video, in one example, when a large number of movie description videos with subtitles currently existing on a network are said to "Wang Xiaoming suddenly pick up a gun pointing to a head bag of Wang Jian", video contents are generally the same contents as those described, that is, the corresponding contents in the movie videos are also described.

And S102, obtaining a matching result of the text and the source video clip based on the video clip of the description video and the source video which are matched with each other and the text of the description video.

In one example, when the caption of the comment-type video says "Wang Xiaoming suddenly picks up the head bag that the gun points to Wang Jian", the segment in the comment-type video corresponding to the caption is matched, and the caption is matched with the matched segment of the movie video based on the matched segment of the comment-type video and the movie video, so as to form a text-video matching pair.

And step S103, generating a model training data sample corresponding to the matching result.

In the embodiment, the text of the description video is matched with the corresponding source video segment to form a text video matching pair, so that text and video matching pair data samples are provided for the training of the multi-modal model; and the source video clips and the corresponding texts can be automatically extracted, so that the time consumption for obtaining model training data samples is shortened.

Referring to fig. 2, a second flowchart of the method for obtaining model training data samples provided in the embodiment of the present application is detailed in step S102 based on fig. 1, and includes the following steps:

step S201, determining matching segments of a source video and an explanation video based on the similarity of frame images in the source video and frame images in the explanation video, wherein the matching segments comprise a first segment and a second segment which are matched with each other, and the first segment is a video segment in the explanation video; the second segment is a video segment in the source video.

When the similarity between each frame of the description video and the corresponding source video is calculated, the similarity may be calculated for all frame images of the description video and the corresponding source video, or the similarity may be calculated for frame images obtained by frame extraction. The similarity can be calculated in various ways, in one example, the similarity between each frame of the description class video and the corresponding source video can be calculated by means of a frame image similarity model, the frame images of the description class video and the frame images of the source video which exceed the similarity threshold are considered to be consistent, and then the matching segments of the description class video and the source video are determined. The similarity threshold may be determined according to actual requirements or experience, and may be set higher in order to ensure accuracy of the matching segments.

Step S202, performing text recognition on the frame image in the first segment to obtain a text corresponding to the first segment.

The first segment is a matching video segment in the description type video, and the text of each frame in the first segment is identified through an OCR (optical character recognition) technology to obtain a text corresponding to the first segment. OCR refers to detecting characters in an image using optical technology and computer technology, and then recognizing the content of the characters in the image. In one example, the first segment may be a matching segment in the narration-type video, i.e., subtitles for each frame in the matching segment of the narration-type video may be recognized by OCR text recognition technology.

And step S203, matching the text with the second segment to obtain a matching result.

In step S201, matching segments of the source video and the description video, that is, a first segment and a second segment that are matched with each other have been determined, where the first segment is a video segment in the description video, and the second segment is a video segment in the source video; in step S202, a text corresponding to the first segment is obtained; and matching the text with the second segment based on the steps S201 and S202 to obtain a matching result of the second segment and the corresponding text.

In the embodiment, the matching segments of the source video and the description video are obtained firstly, then the texts corresponding to the matching segments of the description video are obtained, the matching segments of the source video and the texts are matched based on the matching and corresponding relations, and the text and video matching pairs are formed, so that the text and video matching pair data samples are provided for the training of the multi-modal model, a large amount of manpower is not required to be consumed for labeling, and the time consumption for obtaining the model training data samples is shortened.

Referring to fig. 3a, a third flowchart of the method for obtaining model training data samples provided in the embodiment of the present application, which is based on fig. 1, is refined in step S102, and includes the following steps:

step S301, determining the corresponding relation between the description video time interval and the text in the description video based on the text of the frame image in the description video.

And recognizing the text of each frame of the description video clip by an OCR character recognition technology, and combining the results of the same texts to obtain the text in the description video clip, thereby obtaining the corresponding relation between the time interval of the description video clip and the text in the description video clip.

Step S302, based on the similarity of the frame images in the source video and the frame images in the description video, determining a time interval in which the description video and the source video are matched with each other, and obtaining a corresponding relation between the time interval of the description video and the time interval of the source video.

It should be noted that, in this embodiment, the description-type video time interval is first associated with the text in the description-type video, that is, step S301 is executed first, and then step S302 is executed.

The similarity between the description video and each frame of the corresponding source video can be calculated through the frame image similarity model, or a part of the description video time interval can be selected based on the corresponding relationship between the description video time interval and the text in the description video in step S301, and the similarity between the frame image in the part of the description video time interval and the frame image of the source video can be calculated through the frame image similarity model. And if the frame images of the description video and the frame images of the source video exceed the threshold, the frame images of the description video are consistent, then the frames with similar time in the frame image matching result are combined to form the matching result of the video clip, and the time interval of the matching clip of the description video and the source video is determined, so that the corresponding relation of the time intervals of the description video clip and the source video clip is obtained.

When calculating the similarity between each frame of the frame image in the source video and the frame image in the description video, the frame image may be all the frame images in the description video and the source video, or may be a plurality of frame images obtained by performing frame extraction on the description video and the source video.

And according to a preset frame rate, extracting frames of the description video and the corresponding source video to obtain frame images of the description video and the source video, and calculating the time of each frame image in the corresponding video according to the preset frame rate for confirming a subsequent time interval. The preset frame rate may be 24fps (frames per second), 25fps, 30fps, 48fps, and the like. In one example, to ensure the number of frame images resulting from frame decimation, the frame rate may be set to 48fps.

Step S303, determining the corresponding relation between the time interval of the source video and the text in the description type video according to the corresponding relation between the time interval of the description type video and the time interval of the source video and the corresponding relation between the time interval of the description type video and the text in the description type video.

The corresponding relation between the description video time interval and the text in the description video is obtained in step S301, the corresponding relation between the description video time interval and the source video time interval is obtained in step S302, and the corresponding relation between the source video time interval and the text in the description video is determined based on the two corresponding relations.

Step S304, a source video segment corresponding to the source video time interval is obtained from the source video, and a matching result of the source video segment and the text is obtained.

And after the corresponding relation between the time interval of the source video and the text in the description video is obtained, intercepting the corresponding source video segment according to the time interval of the source video, and obtaining the text and video matching pair.

In the following description, as shown in fig. 3b, the description class video is an explanation video, and the corresponding source video is a movie video. Based on the subtitles of the frame images in the comment video time interval, performing text recognition through OCR, and determining the correspondence between the comment video time interval and the corresponding text thereof, that is, as indicated by the labels of the second rectangular box 1 and the second rectangular box 2 in fig. 3b, the text content is "everyone is in king mansion", "why he asks for lattice" to do so ". Based on the similarity between the frame images in the movie video and the frame images in the explanation video, the time interval during which the explanation video and the movie video are matched with each other is determined, that is, as shown by the labels of the first rectangular frame 1 and the first rectangular frame 2 in fig. 3b, the corresponding relationship between the time interval of the explanation video and the time interval of the movie video is obtained. Determining the corresponding relationship between the movie video time interval and the text according to the corresponding relationship between the explanation video time interval and the movie video time interval and the corresponding relationship between the explanation video time interval and the text thereof, and acquiring a movie video clip corresponding to the movie video time interval from the movie video to obtain a matching result between the movie video clip and the text, namely a third rectangular frame mark in fig. 3 b.

In the above embodiment, the correspondence between the description video time interval and the text in the description video and the correspondence between the description video time interval and the source video time interval are integrated to determine the correspondence between the source video time interval and the text in the description video, and then the corresponding source video segment is extracted according to the source video time interval, so that the source video segment and the corresponding text can be automatically extracted, a text and video matching pair is obtained, and a text and video matching pair data sample is provided for training of a multimodal model. Through an automatic mode, data for multi-modal model training can be obtained quickly, so that the multi-modal model can be developed quickly, and the method can be used in the business fields of retrieval, identification, creation and the like. On the other hand, based on the corresponding relationship between the description video time interval in step S301 and the text in the description video, a part of the description video time interval may be selected, and the similarity between the frame image in the part of the description video time interval and the frame image of the source video may be calculated, so that the step S301 may not be executed first, and the calculation resources may be saved.

Referring to fig. 4, a fourth flowchart of the method for obtaining model training data samples provided in the embodiment of the present application, which is based on fig. 1, details step S102, and includes the following steps:

step S401, based on the similarity of the frame images in the source video and the frame images in the description video, determining a time interval in which the description video and the source video are matched with each other, and obtaining a corresponding relation between the time interval of the description video and the time interval of the source video.

Calculating the similarity between the description video and each frame of the corresponding source video through a frame image similarity model, considering that the frame images of the description video and the frame images of the source video are consistent when the similarity exceeds a threshold value, combining the frames with similar time in the frame image matching result to form a matching result of the video clip, and determining the time interval of the mutually matched clip of the description video and the source video so as to obtain the corresponding relation of the time intervals of the description video clip and the source video clip.

Step S402, determining the corresponding relation between the description video time interval and the text in the description video based on the text of the frame image in the description video.

The detailed analysis is the same as above, and will not be described herein.

It should be noted that, in the present embodiment, the execution sequence of steps S401 and S402 is reverse to the execution sequence of steps S301 and S302 in the above embodiment. In this embodiment, the description video time interval is corresponding to the text in the description video, that is, step S302 in the above embodiment is executed first, and then step S301 in the above embodiment is executed.

Step S403, determining a correspondence between the time interval of the source video and the text in the description video according to the correspondence between the time interval of the description video and the time interval of the source video and the correspondence between the time interval of the description video and the text in the description video.

The detailed analysis is the same as above, and is not repeated herein.

Step S404, a source video segment corresponding to the source video time interval is obtained from the source video, and a matching result of the source video segment and the text is obtained.

The detailed analysis is the same as above, and is not repeated herein.

In the embodiment, the corresponding relation between the description type video time interval and the source video time interval and the corresponding relation between the description type video time interval and the text in the description type video are integrated to determine the corresponding relation between the source video time interval and the text in the description type video, and then the corresponding source video fragment is extracted according to the source video time interval, so that the source video fragment and the corresponding text can be automatically extracted to obtain the text-video matching pair, and a text-video matching pair data sample is provided for the training of the multi-modal model. Through an automatic mode, data for multi-modal model training can be obtained quickly, so that the multi-modal model can be developed quickly and can be used in the business fields of retrieval, identification, creation and the like. On the other hand, the description video time interval and the source video time interval are firstly corresponding, the corresponding relation between the text and the description video cannot be determined, then a part of the description video with the text is selected, and the similarity calculation is carried out on the frame image in the time interval and the frame image of the source video, so that the similarity between each frame of the description video and each frame of the corresponding source video needs to be calculated, and the recall ratio of the text and the video matching to the data sample can be improved.

Referring to fig. 5, a fifth flowchart of the method for obtaining model training data samples provided in the embodiment of the present application is detailed in step S103 based on fig. 1, and includes the following steps:

step S501, generating model training data corresponding to one frame of image in a source video clip and model training data corresponding to one frame of image-matched text to form a model training data sample;

step S502, based on one frame of image in the source video clip, combining the multiple frames of images in the source video clip, generating model training data corresponding to the multiple frames of images in the source video clip and model training data corresponding to the text matched with the multiple frames of images, and forming a model training data sample.

In the above embodiment, the multi-modal model training data samples may be composed of model training data corresponding to one frame of image in the source video segment and model training data corresponding to one frame of image-matched text, or may be composed of model training data corresponding to multiple frames of images in the source video segment and model training data corresponding to multiple frames of image-matched text, and the number of model training data samples is not limited.

Referring to fig. 6, a sixth flowchart of the method for obtaining model training data samples provided in the embodiment of the present application is detailed in step S401 based on fig. 4, and includes the following steps:

step S601, for each frame image of the plurality of frame images in the description-like video, retrieving a similar frame image having a similarity not smaller than a preset similarity threshold with the frame image from the source video.

And searching the source video frame images with the similarity exceeding a preset similarity threshold for each frame image of the plurality of frame images in the description video by using any frame image searching method, and keeping the previous K searching results. In order to ensure the accuracy of the matching segments, the value of k may be set to 2 to 5, and in one example, the value of k may be 3. The frame image retrieval method comprises a perceptual hash algorithm, a color distribution method, a content characteristic method and the like. In one example, content characterization methods may be employed to perform the retrieval in the source video. In addition, in order to ensure high similarity between the frame images in the description type video and the frame images in the source video, the preset similarity threshold may be set higher.

For each frame image in a plurality of frame images in the description video, the plurality of frame images may be all frame images in the description video or a plurality of frame images obtained by performing frame extraction on the description video.

And according to a preset frame rate, extracting frames of the description video and the corresponding source video to obtain frame images of the description video and the source video, and calculating the time of each frame image in the corresponding video according to the preset frame rate for confirming a subsequent time interval. The preset frame rate may be 24fps (frame per second), 25fps, 30fps, 48fps, etc. In one example, to ensure the number of frame images resulting from frame decimation, the frame rate may be set to 48fps.

Step S602, if the similar frame images corresponding to the adjacent frame images in the description video satisfy the preset similar condition, merging the time of the adjacent frame images to obtain a first description video time interval, and merging the time of the similar frame images to obtain a source video time interval, where the adjacent frame images are temporally adjacent frame images in the description video.

In a possible implementation manner, the step of satisfying a preset similarity condition by similar frame images corresponding to adjacent frame images in the description-type video includes: the similar frame images corresponding to the adjacent frame images in the description video have the same frame image, or the time difference of the similar frame images corresponding to the adjacent frame images in the description video is smaller than a preset time difference value.

If the retrieval results of the frame images adjacent to each other in the description-like video time include the same frame image or the time difference of the similar frame images corresponding to the adjacent frame images in the description-like video is smaller than the preset time difference, combining the time of the frame images of the description-like video appearing in the description-like video to form a time interval of the first description-like video, and combining the time of the frame images (similar frame images) of the source video meeting the above condition appearing in the source video to form a time interval of the source video, namely obtaining the time intervals of the first description-like video and the source video respectively.

The preset time difference may range from 1/48 second to 1/24 second, and in one example, the preset time difference may be set to 1/48 second.

Step S603, establishing a corresponding relationship between the first description type video time interval and the source video time interval.

And establishing a corresponding relation between the time interval of the first description type video and the time interval of the source video based on the obtained time intervals of the first description type video and the source video.

In the above embodiment, if the similar frame images corresponding to the adjacent frame images in the description type video satisfy the preset similar condition, merging the time of the adjacent frame images to obtain a first description type video time interval, and merging the time of the similar frame images to obtain a source video time interval; through the obtained time intervals of the first description type videos and the source videos, the corresponding relation between the time interval of the first description type videos and the time interval of the source videos is obtained.

Referring to fig. 7, a seventh flowchart of the method for acquiring a model training data sample according to the embodiment of the present application is detailed in step S402 based on fig. 4, and includes the following steps:

step S701, aiming at each frame image in a plurality of frame images in the description video, performing text recognition on the frame image by using a text recognition algorithm to obtain a text recognition result corresponding to the frame image.

The plurality of frame images may be all frame images in the description type video, or may be a plurality of frame images obtained by performing frame extraction on the description type video. And according to a preset frame rate, performing frame extraction on the description video to obtain frame images of the description video, and calculating the time of each frame image in the description video according to the preset frame rate of the frame extraction so as to confirm the subsequent time interval. The preset frame rate may be 24fps, 25fps, 30fps, 48fps, etc. In one example, to ensure the number of frame images resulting from frame decimation, the frame rate may be set to 48fps.

The OCR text recognition algorithm comprises CNN + RNN + CTC (Convolutional Neural Networks, recurrent Neural Networks, connective Temporal Classification based on Neural Networks), CNN + RNN based on Attention (Convolutional Neural Networks, recurrent Neural Networks, attention-based mechanisms) and the like, and the OCR result, namely the corresponding text recognition result, in each frame image in the frame images in the description type video is recognized by using any OCR algorithm. In one example, a CNN + RNN + CTC may be used to identify text in each of a plurality of frame images in a description-type video.

Step S702, if the text recognition results corresponding to the adjacent frame images in the description video meet the preset matching conditions, unifying the text recognition results corresponding to the adjacent frame images into the target text.

In a possible implementation manner, the step of enabling the text recognition results corresponding to the adjacent frame images in the description-type video to satisfy the preset matching condition includes: the text recognition results corresponding to the adjacent frame images in the description video are the same, or the matching degree between the text recognition results corresponding to the adjacent frame images in the description video is not greater than the preset matching degree.

If the matching degree (the edit distance of two texts/the character string length of a shorter text, namely the ratio of the edit distance of two text recognition results to the character string length of a shorter text recognition result for the text recognition results corresponding to any two adjacent frame images) of the OCR results of the frame images adjacent to the description video time (the adjacent frame images in the description video) is not greater than the preset matching degree, unifying the OCRs into a same text (if the OCRs are the same, the texts are the same, and if the OCRs are different, voting is carried out on the occurrence times of different character strings, and the maximum number of times is used as the final OCR text). The editing distance is mainly used for comparing the similarity of two character strings and is a character string measurement standard for measuring the difference between the two character strings, and the editing distance between the two character strings is the minimum number of single character edits (insertion, deletion or replacement) required for converting one character string into another character string, namely the minimum number of editing operations required for converting one character string into another character string. Generally, the smaller the edit distance, the greater the similarity of two strings; the larger the edit distance, the lower the similarity of the two strings.

In addition, the length of the character string of the longer of the two texts may be used as the denominator of the matching degree parameter, which is not limited in the present application.

In one example, in order to ensure high similarity between the text recognition results of the adjacent frame images, the preset matching degree may be set to be smaller, and the correspondence between the subsequent time intervals is also facilitated.

Step S703, merging the time of the adjacent frame images to obtain a second description type video time interval.

And combining the time of the adjacent frame images corresponding to the text recognition results meeting the preset matching conditions in the description video to form a time interval of the second description video.

Step S704, a corresponding relationship between the second description type video time interval and the target text is established.

And obtaining the corresponding relation between the time interval of the second description type video and the target text based on the obtained time interval of the second description type video and the corresponding target text.

In the above embodiment, the text recognition results corresponding to the adjacent frame images in the description-like video are unified into the target text according to the preset matching condition, that is, if the character strings in the text recognition results corresponding to the adjacent frame images are different, the occurrence times of the different character strings are counted, and the text recognition result corresponding to the character string with the largest occurrence time is used as the text (target text) corresponding to the adjacent frame image. Merging the time of the adjacent frame images, namely merging the time of the adjacent frame images corresponding to the text recognition result meeting the preset matching condition appearing in the description video to obtain a second description video time interval; and acquiring the corresponding relation between the time interval of the second description type video and the target text based on the obtained time interval of the second description type video and the corresponding target text.

Referring to fig. 8, an eighth flowchart of the method for obtaining model training data samples provided in the embodiment of the present application, which is based on fig. 4, refines step S403, and includes the following steps:

step S801, calculating, for each time interval of at least one time interval included in the second description type video time interval, a coincidence degree between the time interval and each time interval included in the first description type video time interval, where the coincidence degree is a ratio of an intersection time interval length to a union time interval length.

The first description type video time interval comprises at least one time interval and the second description type video time interval comprises at least one time interval.

For each time interval in the first description type video time interval and the second description type video time interval, the first description type video time interval may not be completely the same as the second description type video time interval, so it needs to be corresponded according to a certain method: for each second description class video time interval, traversing each first description class video time interval, and calculating the coincidence ratio of the time intervals IoU (IoU = intersection time interval length/union time interval length). The intersection time interval length is the intersection length of the time interval of the first description type video and the time interval of the second description type video, and the union time interval length is the union length of the time interval of the first description type video and the time interval of the second description type video.

In step S802, if the contact ratio is not less than the preset contact ratio threshold, a corresponding relationship between the source video time interval and the text is established according to a corresponding relationship between the first description type video time interval and the source video time interval and a corresponding relationship between the second description type video time interval and the text.

For each time interval in the first description type video time interval and the second description type video time interval, if IoU is not smaller than a preset coincidence degree threshold value, the first description type video time interval and the second description type video time interval are considered to be basically corresponding, and then the source video time interval and the target text which respectively correspond to the first description type video time interval and the second description type video time interval are corresponding to each other, and the corresponding relation between the source video time interval and the target text is established. In order to ensure establishment of the corresponding relationship between the source video time interval and the target text, the preset overlap ratio threshold may be set higher.

In the above embodiment, by calculating the coincidence degree of the first description type video time interval and the second description type video time interval, if the coincidence degree is not less than the preset coincidence degree threshold, the first description type video time interval and the second description type video time interval are considered to be substantially corresponding, and the source video time interval and the target text corresponding to the first description type video time interval and the second description type video time interval are respectively corresponding to each other, so that the acquisition of the corresponding relationship between the source video time interval and the target text is realized.

Based on the same inventive concept as the method for obtaining model training data samples provided in the above embodiments, the embodiments of the present application provide a device for obtaining model training data samples, and referring to fig. 9, the device includes:

a first obtaining module 910, configured to obtain a source video and a description video corresponding to the source video, where the description video includes an image and a text describing the source video;

a matching module 920, configured to obtain a matching result between a text and a source video segment based on a video segment in which the description video and the source video are matched with each other and the text of the description video;

and a generating module 930, configured to generate a model training data sample corresponding to the matching result.

In one possible implementation, the matching module 920 includes:

a first determining sub-module, configured to determine, based on similarity between frame images in the source video and frame images in the description-type video, a matching segment between the source video and the description-type video, where the matching segment includes a first segment and a second segment that are matched with each other, and the first segment is a video segment in the description-type video; the second segment is a video segment in the source video;

the first obtaining sub-module is used for performing text recognition on the frame image in the first segment to obtain a text corresponding to the first segment;

and the second obtaining submodule is used for matching the text with the second fragment to obtain a matching result.

In one possible implementation, the matching module 920 includes:

and the third obtaining sub-module is used for obtaining a source video clip corresponding to the source video time interval from the source video to obtain a matching result of the source video clip and the text.

In the embodiment, the corresponding relation between the description video time interval and the text in the description video and the corresponding relation between the description video time interval and the source video time interval are integrated to determine the corresponding relation between the source video time interval and the text in the description video, and then the corresponding source video clip is extracted according to the source video time interval, so that the source video clip and the corresponding text can be automatically extracted to obtain the text and video matching pair, and a text and video matching pair data sample is provided for the training of the multi-modal model. Through an automatic mode, data for multi-modal model training can be obtained quickly, so that the multi-modal model can be developed quickly and can be used in the business fields of retrieval, identification, creation and the like. On the other hand, the similarity between the time interval of the description type video and the frame image of the corresponding source video can be calculated through the frame image similarity model, so that the frame images of the description type video do not need to be matched in a full amount, and the calculation resources are saved.

In one possible implementation, the matching module 920 includes:

a seventh determining submodule, configured to determine a correspondence between the source video time interval and the text in the description type video according to a correspondence between the description type video time interval and the source video time interval and a correspondence between the description type video time interval and the text in the description type video;

In the embodiment, the corresponding relation between the description type video time interval and the source video time interval and the corresponding relation between the description type video time interval and the text in the description type video are integrated to determine the corresponding relation between the source video time interval and the text in the description type video, and then the corresponding source video fragment is extracted according to the source video time interval, so that the source video fragment and the corresponding text can be automatically extracted to obtain the text-video matching pair, and a text-video matching pair data sample is provided for the training of the multi-modal model. Through an automatic mode, data for multi-modal model training can be obtained quickly, so that the multi-modal model can be developed quickly and can be used in the business fields of retrieval, identification, creation and the like. On the other hand, the similarity between each frame of the description video and the corresponding source video is calculated through the frame image similarity model, the corresponding relation between the time interval of the description video and the time interval of the source video is obtained, and then text recognition is carried out, so that the recall ratio of text and video matching on data samples can be improved.

In one possible implementation, the generating module 930 includes:

In a possible implementation, the fifth determining submodule is specifically configured to: for each frame image in a plurality of frame images in the description type video, retrieving a similar frame image with the similarity of the frame image not less than a preset similarity threshold from the source video; if the similar frame images corresponding to the adjacent frame images in the description video meet preset similar conditions, merging the time of the adjacent frame images to obtain a first description video time interval, and merging the time of the similar frame images to obtain a source video time interval, wherein the adjacent frame images are temporally adjacent frame images in the description video; and establishing a corresponding relation between the first description type video time interval and the source video time interval.

In the above embodiment, if the similar frame images corresponding to the adjacent frame images in the description type video satisfy the preset similar condition, merging the time of the adjacent frame images to obtain a first description type video time interval, and merging the time of the similar frame images to obtain a source video time interval; and acquiring the corresponding relation between the time interval of the first description type video and the time interval of the source video through the acquired time intervals of the first description type video and the source video.

In a possible implementation, the sixth determining submodule is specifically configured to: performing text recognition on each frame image in a plurality of frame images in the description video by using a text recognition algorithm to obtain a text recognition result corresponding to the frame image; if the text recognition results corresponding to the adjacent frame images in the description video meet preset matching conditions, unifying the text recognition results corresponding to the adjacent frame images into a target text; merging the time of the adjacent frame images to obtain a second description video time interval; and establishing a corresponding relation between the second description type video time interval and the target text.

In a possible implementation, the seventh determining sub-module is specifically configured to: for each time interval in at least one time interval included in the second description type video time interval, calculating the coincidence degree of the time interval and each time interval included in the first description type video time interval, wherein the coincidence degree is the ratio of the intersection time interval length to the union time interval length; and if the contact ratio is not less than a preset contact ratio threshold value, establishing a corresponding relation between the source video time interval and the text according to the corresponding relation between the first description type video time interval and the source video time interval and the corresponding relation between the second description type video time interval and the text.

In a possible implementation manner, the step of satisfying a preset similarity condition by similar frame images corresponding to adjacent frame images in the description-type video includes: similar frame images corresponding to adjacent frame images in the description type video have the same frame image, or the time difference of the similar frame images corresponding to the adjacent frame images in the description type video is smaller than a preset time difference value.

In a possible implementation, the sixth determining submodule is specifically configured to:

if the character strings in the text recognition results corresponding to the adjacent frame images are different, counting the occurrence times of the different character strings, and taking the text recognition result corresponding to the character string with the largest occurrence time as the text corresponding to the adjacent frame image.

In the embodiment, by counting the occurrence times of different character strings, the unification of the text recognition results corresponding to the adjacent frame images in the description video is realized, and the text corresponding to the adjacent frame images is obtained.

The embodiment of the present application further provides an electronic device, as shown in fig. 10, which includes a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete mutual communication through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the method steps of the model training data sample obtaining method provided in any one of the embodiments when executing the program stored in the memory 1003.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the model training data sample obtaining method described in any of the above embodiments.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the model training data sample acquisition method described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A model training data sample acquisition method, characterized in that the method comprises:

acquiring a source video and a description video corresponding to the source video, wherein the description video comprises an image and a text for describing the source video;

2. The method according to claim 1, wherein obtaining a matching result of the text and the source video segment based on the video segment of the description-type video and the text of the description-type video, where the description-type video and the source video match each other, comprises:

and matching the text with the second fragment to obtain a matching result.

3. The method according to claim 1, wherein obtaining a matching result of the text and the source video segment based on the video segment of the description-type video and the text of the description-type video, where the description-type video and the source video match each other, comprises:

4. The method according to claim 1, wherein obtaining a matching result of the text and the source video segment based on the video segment of the description-type video and the text of the description-type video, where the description-type video and the source video match each other, comprises:

determining a time interval in which the description video and the source video are matched with each other based on the similarity of the frame images in the source video and the frame images in the description video, and obtaining a corresponding relation between the time interval of the description video and the time interval of the source video;

5. The method of claim 1, wherein generating the model training data sample corresponding to the matching result comprises:

6. The method according to claim 4, wherein the determining, based on the similarity between the frame images in the source video and the frame images in the description-type video, the time interval during which the description-type video and the source video match each other and obtaining the correspondence between the description-type video time interval and the source video time interval comprises:

7. The method according to claim 4, wherein the determining the correspondence between the description-type video time interval and the text in the description-type video based on the text of the frame image in the description-type video comprises:

8. The method of claim 4, wherein determining the correspondence between the source video time interval and the text in the description type video according to the correspondence between the description type video time interval and the source video time interval and the correspondence between the description type video time interval and the text in the description type video comprises:

9. The method of claim 6, wherein the step of satisfying the preset similarity condition for the similar frame images corresponding to the adjacent frame images in the description-type video comprises: similar frame images corresponding to adjacent frame images in the description type video have the same frame image, or the time difference of the similar frame images corresponding to the adjacent frame images in the description type video is smaller than a preset time difference value.

10. The method of claim 7, wherein the step of determining that the text recognition results corresponding to the adjacent frames of images in the description-type video satisfy a predetermined matching condition comprises: the text recognition results corresponding to the adjacent frame images in the description video are the same, or the matching degree between the text recognition results corresponding to the adjacent frame images in the description video is not greater than the preset matching degree.

11. The method according to claim 7, wherein unifying the text recognition results corresponding to the adjacent frame images into the target text comprises:

12. A model training data sample acquisition apparatus, the apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a source video and a description video corresponding to the source video, and the description video comprises an image and a text for describing the source video;

13. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 11 when executing a program stored in the memory.

14. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-11.