CN114140728B

CN114140728B - Text type classification method, device, electronic device and storage medium

Info

Publication number: CN114140728B
Application number: CN202111471108.5A
Authority: CN
Inventors: 王晨旭
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2025-04-01
Anticipated expiration: 2041-12-03
Also published as: CN114140728A

Abstract

The embodiment of the invention provides a text type classification method and device, wherein the method comprises the steps of obtaining a text mask image to be processed, determining a non-mask area of each text area for each text area, dividing each text area into a first text mask area and a second text mask area based on each non-mask area, determining the text type of the first text mask area based on the position relation between the non-mask area and the first text mask area, and determining the text type of the second text mask area based on the position relation between the non-mask area, the second text mask area and the text mask image. The embodiment of the invention realizes the classification of the text types in the text region, can solve the technical problem that the text cannot be distinguished into the speech text and the name text after the conventional text filtering, and achieves the effect of distinguishing each text type after the text filtering.

Description

Text type classification method, apparatus, electronic device and storage medium

Technical Field

The present invention relates to the field of computer technology, and in particular, to a text type classification method and apparatus, an electronic device, and a computer readable storage medium.

Background

With the rapid development of the internet and computer technology, there is a large amount of video content on conventional television stations and the internet. To assist the user in understanding the video content, corresponding text may typically be provided for the video content, such as, for example, presentation of speech text, name text, and non-speech text in a variety-like video.

When there is a large amount of text in the video content, interference may occur between the texts. For example, non-speech text in a variety class video may interfere with speech text and name text. Thus, there is a need to filter text in video content to reveal more valuable text. After text filtering, the line text and the name text may be preserved, but the related art cannot distinguish the line text and the name text.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a method and apparatus for classifying text types, an electronic device, and a computer-readable storage medium, which solve the problem of how to classify text types. The specific technical scheme is as follows:

In a first aspect of the present invention, a method for classifying text types is provided, including obtaining a text mask image to be processed, where the text mask image includes at least one text region, determining, for each of the text regions, a non-mask region for each of the text regions, dividing each of the text regions into a first text mask region and a second text mask region based on each of the non-mask regions, determining, for each of the text regions, a text type for the first text mask region based on a positional relationship between the non-mask region and the first text mask region, and determining, for each of the text regions, a text type for the second text mask region based on a positional relationship between the non-mask region, the second text mask region, and the text mask image.

Optionally, the determining the non-mask area of each text area includes obtaining a pixel value of each pixel point in each text area, and taking an area where the pixel value meets a preset pixel condition as the corresponding non-mask area of the text area.

Optionally, the area, where the pixel value meets the preset pixel condition, is taken as the corresponding non-mask area of the text area, and the area, where the average value of the pixel values is smaller than the preset pixel threshold value, is taken as the non-mask area.

Optionally, determining the text type of the first text mask region based on the position relation between the non-mask region and the first text mask region comprises obtaining a first abscissa of the non-mask region and a second abscissa of a right boundary of the first text mask region, classifying the first text mask region as the first text type if the value of the second abscissa is smaller than the value of the first abscissa, or classifying the first text mask region as the first text type if the value of the second abscissa is larger than the value of the first abscissa and the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is smaller than a preset value.

Optionally, determining the text type of the second text mask region based on the position relation among the non-mask region, the second text mask region and the text mask image comprises acquiring a first abscissa of the non-mask region, a third abscissa of a left boundary of the second text mask region and a fourth abscissa of the left boundary of the text mask image, classifying the second text mask region as the second text type if a value of the third abscissa is larger than a value of the first abscissa, or classifying the second text mask region as the second text type if an absolute value of a difference between a value of the third abscissa and a value of the fourth abscissa is larger than an absolute value of a difference between a value of the third abscissa and a value of the first abscissa.

Optionally, the method further comprises the steps of acquiring a first abscissa of the non-mask area, a second abscissa of a right boundary of the first text mask area, a third abscissa of a left boundary of the second text mask area and a fourth abscissa of a left boundary of the text mask image, classifying the first text mask area as the second text type and classifying the second text mask area as the first text type if at least one of preset comparison conditions is met, wherein the preset comparison conditions comprise that a value of the second abscissa is larger than or equal to a value of the first abscissa, a value of the second abscissa is smaller than or equal to a value of the first abscissa, an absolute value of a difference between a value of the first abscissa and a value of the second abscissa is larger than or equal to a preset value, and a value of a difference between a value of the third abscissa and a value of the third abscissa is smaller than or equal to a value of the third abscissa.

Optionally, the obtaining the text mask image to be processed includes accumulating pixel values of all pixel points in the frame image to which each text region belongs to obtain a pixel value accumulation result, and normalizing and binarizing the pixel value accumulation result to obtain the text mask image.

In a second aspect of the implementation of the present invention, there is further provided a text type classification device, including an image acquisition module configured to acquire a text mask image to be processed, where the text mask image includes at least one text region, a region segmentation module configured to determine, for each of the text regions, a non-mask region of each of the text regions, segment each of the text regions into a first text mask region and a second text mask region based on each of the non-mask regions, a type classification module configured to determine, for each of the text regions, a text type of the first text mask region based on a positional relationship between the non-mask region and the first text mask region, and determine, for each of the text regions, a text type of the second text mask region based on a positional relationship between the non-mask region, the second text mask region, and the text mask image.

Optionally, the region segmentation module comprises a pixel value acquisition module and a non-mask region determination module, wherein the pixel value acquisition module is used for acquiring pixel values of all pixel points in each text region, and the non-mask region determination module is used for taking a region with the pixel values meeting preset pixel conditions as the corresponding non-mask region of the text region.

Optionally, the non-mask area determining module is configured to use an area where the average value of the pixel values is smaller than a preset pixel threshold value as the non-mask area.

The type classification module comprises a coordinate acquisition module, a text type classification module and a text type classification module, wherein the coordinate acquisition module is used for acquiring a first abscissa of the non-mask area and a second abscissa of the right boundary of the first text mask area, the text type classification module is used for classifying the first text mask area into a first text type if the value of the second abscissa is smaller than the value of the first abscissa, or classifying the first text mask area into the first text type if the value of the second abscissa is larger than the value of the first abscissa and the absolute value of the difference value between the value of the first abscissa and the value of the second abscissa is smaller than a preset value.

Optionally, the coordinate acquisition module is further configured to acquire a third abscissa of a left boundary of the second text mask region and a fourth abscissa of a left boundary of the text mask image, and the text type classification module is further configured to classify the second text mask region as the second text type if a value of the third abscissa is greater than a value of the first abscissa, or classify the second text mask region as the second text type if an absolute value of a difference between a value of the third abscissa and a value of the fourth abscissa is greater than an absolute value of a difference between a value of the third abscissa and a value of the first abscissa.

Optionally, the text type classification module is further configured to classify the first text mask area as the second text type and classify the second text mask area as the first text type if the first abscissa, the second abscissa, the third abscissa and the fourth abscissa at least meet one of preset comparison conditions, where the preset comparison conditions include that the value of the second abscissa is greater than or equal to the value of the first abscissa, the value of the second abscissa is less than or equal to the value of the first abscissa, the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is greater than or equal to a preset value, the value of the third abscissa is less than or equal to the value of the first abscissa, the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is less than or equal to the absolute value of the difference between the value of the third abscissa and the value of the first abscissa.

Optionally, the image acquisition module comprises a pixel accumulation module and a preprocessing module, wherein the pixel accumulation module is used for accumulating pixel values of all pixel points in the frame image to which each text region belongs to obtain a pixel value accumulation result, and the preprocessing module is used for carrying out normalization processing and binarization processing on the pixel value accumulation result to obtain the text mask image.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform any of the above-described text type classification methods.

In yet another aspect of the invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of classifying text types as described in any of the above.

According to the text type classification scheme provided by the embodiment of the invention, by adopting the text regions in the text mask image, the non-mask region of each text region is determined, and each text region is divided into a first text mask region and a second text mask region based on each non-mask region. Then, for each text region, a text type of the first text mask region is determined based on a positional relationship between the non-mask region and the first text mask region. Further, for each text region, a technical means of determining the text type of the second text mask region based on the positional relationship among the non-mask region, the second text mask region, and the text mask image. Under the condition that the text region comprises two text mask regions, namely the text region is divided into a first text mask region and a second text mask region, the text type of the first text mask region and the text type of the second text mask region are respectively determined by utilizing the position relations among the non-mask region, the first text mask region, the second text mask region and the text mask image of the text region, so that the classification of the text types in the text region is realized, the technical problem that the text is a line text and a name text after the existing text filtering can be solved, and the effect that each text type can be distinguished after the text filtering is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart illustrating steps of a text type classification method according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a text region filtering method according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating a step of a method for classifying a name text type and a line text type in a variety video data according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a text type classifying device according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Fig. 6 is a schematic workflow diagram of a system for identifying a caption in an audio/video file according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

The embodiment of the invention provides a text type classification scheme which is mainly used for classifying a name area and a speech area in a text area. First, it is determined whether a division column exists in the text region, and if the division column exists in the text region, it is necessary to further identify a division line from the text region, indicating that the text region may contain a name text. And classifying the name area and the line area in the text area according to the position relation among the parting line, the boundary of the text area and the boundary of the text mask image.

Referring to fig. 1, a flowchart illustrating steps of a text type classification method according to an embodiment of the present invention is shown. The text type classification method can be applied to a terminal or a server. The text type classification method may specifically include the following steps.

And step 101, acquiring a text mask image to be processed.

In an embodiment of the present invention, the text mask image may be an image after filtering text regions in the text detection image. Wherein the text detection image may be derived from video data or audio-video data. In practical applications, the text detection image may be video data or a frame image of audio-video data. Also, a text region may be included in the text detection image. The text region can be understood as the location of the text content in a frame of image. In one embodiment, the text detection image may include a text region or a plurality of text regions. The text region in the text detection image may be understood as the text region before filtering, i.e. the text region in the text detection image may contain speech text content, name text content, non-speech text content, etc. The text region in the text mask image may be understood as the text region after filtering, i.e. the text region in the text mask image may contain speech text content and/or name text content, not non-speech text content.

Step 102, determining a non-mask area of each text area for each text area, and dividing each text area into a first text mask area and a second text mask area based on each non-mask area.

In the embodiment of the present invention, in the case where the text region includes a divided column, the text region may be considered to include a name region. The segmentation columns of the text region can be understood as boundaries of different types of text content in the text region. In one embodiment, the segmented columns of the text region may be represented using an abscissa, i.e., the column boundaries of different types of text content in the text region are approximately determined using the segmented columns of the text region. If there is a split column in the text region, it indicates that there may be name text content in the text region. It is necessary to further identify a dividing line, and judge whether the text region contains the name text content and/or the line text content using the dividing line.

The non-mask area in the embodiment of the invention can be a dividing line or contain a dividing line. The text region is divided into two parts, namely a first text mask region and a second text mask region by taking the non-mask region as a dividing boundary. Further, the text type of the first text mask area and the text type of the second text mask area are determined, respectively.

Step 103, determining the text type of the first text mask area based on the position relation between the non-mask area and the first text mask area for each text area.

In the embodiment of the invention, the position relation between the non-mask area and the first text mask area can be determined according to the boundaries of the non-mask area and the boundaries of the first text mask area, and then the text type of the first text mask area can be determined according to the position relation. In practical applications, the text type of the first text mask region may be classified as a human text region or a speech text region box.

Step 104, determining the text type of the second text mask area based on the positional relationship among the non-mask area, the second text mask area and the text mask image for each text area.

In the embodiment of the invention, the position relationship between the non-mask region and the second text mask region and the position relationship among the non-mask region, the second text mask region and the text mask image can be determined according to the boundaries of the non-mask region, the boundaries of the second text mask region and the boundaries of the text mask image, so that the text type of the second text mask region is determined according to the position relationship. In practical applications, the text type of the second text mask region may be classified as a human text region or a speech text region box.

The step execution relationship between the step 103 and the step 104 may be a parallel execution relationship or a sequential execution relationship. The sequential execution relationship may be to execute step 103 first and then execute step 104, or to execute step 104 first and then execute step 103.

In an exemplary embodiment of the present invention, one implementation manner of determining the non-mask area of each text area is to obtain a pixel value of each pixel point in each text area, and take an area where the pixel value meets a preset pixel condition as the non-mask area of the corresponding text area box. In practical applications, an area where the average value of the pixel values is smaller than the preset pixel threshold value may be used as the non-mask area. For example, in the text region T1, there is a region Tmz in which the average value of the pixel values is smaller than the preset pixel threshold, the region Tmz is regarded as a non-mask region, and further, the middle column Tmzl of the region Tmz may be regarded as a division line. The preset pixel threshold may be set to 255, and the numerical value, unit, and the like of the preset pixel threshold are not particularly limited in the embodiment of the present invention.

In an exemplary embodiment of the present invention, an implementation of determining the text type of the first text mask region based on the positional relationship between the non-mask region and the first text mask region is to obtain a first abscissa of the non-mask region and a second abscissa of a right boundary of the first text mask region, and if a value of the second abscissa is smaller than a value of the first abscissa, classify the first text mask region as the first text type. Or if the value of the second abscissa is greater than the value of the first abscissa and the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is less than the preset value, classifying the first text mask area as the first text type. That is, the positional relationship between the non-mask region and the right boundary of the first text mask region is determined based on the first abscissa of the non-mask region and the second abscissa of the right boundary of the first text mask region. When the value of the second abscissa is smaller than the value of the first abscissa, it means that the right boundary of the first text mask area is located at the left side of the non-mask area, and the first text mask area may be classified as the first text type. When the value of the second abscissa is greater than the value of the first abscissa, it means that the right boundary of the first text mask area is located on the right side of the non-mask area. Moreover, the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is smaller than the preset value, which means that the distance between the non-mask area and the right boundary of the first text mask area is smaller than the preset value. When the right boundary of the first text mask region is located at the right side of the non-mask region and a distance between the non-mask region and the right boundary of the first text mask region is less than a preset value, the first text mask region may be classified as the first text type. The preset value may be set to 10 pixel units, and the embodiment of the present invention does not specifically limit the preset value.

In an exemplary embodiment of the present invention, one implementation of determining the text type of the second text mask region based on the positional relationship among the non-mask region, the second text mask region, and the text mask image is to acquire a first abscissa of the non-mask region, a third abscissa of a left boundary of the second text mask region, and a fourth abscissa of a left boundary of the text mask image, and if a value of the third abscissa is greater than a value of the first abscissa, classify the second text mask region as the second text type. Or classifying the second text mask region as a second text type if the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is greater than the absolute value of the difference between the value of the third abscissa and the value of the first abscissa. That is, the positional relationship between the non-mask region, the left boundary of the second text mask region, and the left boundary of the text mask image is determined based on the first abscissa of the non-mask region, the third abscissa of the left boundary of the second text mask region, and the fourth abscissa of the left boundary of the text mask image. When the value of the third abscissa is greater than the value of the first abscissa, it indicates that the left boundary of the second text mask area is located at the right side of the non-mask area, and the second text mask area may be classified as the second text type. When the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is greater than the absolute value of the difference between the value of the third abscissa and the value of the first abscissa, the distance from the left boundary of the second text mask area to the left boundary of the text mask image is represented to be greater than the distance from the left boundary of the second text mask area to the non-mask area, at which time the second text mask area may be classified as the second text type.

In an exemplary embodiment of the present invention, in addition to the two embodiments of determining the text types of the first text mask region and the second text mask region, a first abscissa of the non-mask region, a second abscissa of a right boundary of the first text mask region, a third abscissa of a left boundary of the second text mask region, and a fourth abscissa of a left boundary of the text mask image may be acquired, and if at least one of the preset comparison conditions is satisfied by the first abscissa, the second abscissa, the third abscissa, and the fourth abscissa, the first text mask region is classified as the first text type, and the second text mask region is classified as the second text type. The preset comparison condition comprises that the value of the second abscissa is larger than or equal to the value of the first abscissa, the value of the second abscissa is smaller than or equal to the value of the first abscissa, the absolute value of the difference value between the value of the first abscissa and the value of the second abscissa is larger than or equal to the preset value, the value of the third abscissa is smaller than or equal to the value of the first abscissa, the absolute value of the difference value between the value of the third abscissa and the value of the fourth abscissa is smaller than or equal to the absolute value of the difference value between the value of the third abscissa and the value of the first abscissa.

In an exemplary embodiment of the present invention, in one implementation manner of obtaining the text mask image to be processed, the pixel values of the pixel points in the frame image to which each text region belongs are accumulated to obtain a pixel value accumulation result, and the pixel value accumulation result is preprocessed to obtain the text mask image. The preprocessing may include normalization processing, binarization processing, and the like, and the embodiment of the present invention does not specifically limit the processing conditions, processing contents, processing steps, processing results, and the like of the preprocessing.

It should be noted that, the first text mask area may be located at a left portion of the text area, and the second text mask area may be located at a right portion of the text area. The first text type may be a human text type and the second text type may be a line text type. The text mask image is obtained after filtering the text region in the text detection image. The filtering step with respect to text regions in the text detection image may be referred to fig. 2. As shown in fig. 2, a flowchart of steps of a text region filtering method according to an embodiment of the present invention is shown. The text filtering method specifically may include the following steps.

Step 201, a frame image to be identified is acquired.

In an embodiment of the invention, the frame image may be derived from video data or audio-video data. In practical applications, the frame image may be a frame image of video data or audio-video data. Also, a text region may be included in the frame image. The text region can be understood as a region in which text content in one frame of image is located. In one embodiment, a frame image may include a text region or a plurality of text regions.

Step 202, extracting characteristic values of each pixel point in the frame image.

In the embodiment of the invention, the frame image can be input into a pre-trained network model, and then the characteristic value of each pixel point in the frame image is extracted by using the network model.

And 203, calculating the gradient of each pixel point according to the characteristic value, and identifying the frame image position corresponding to the gradient peak value in each gradient from the frame image.

In the embodiment of the invention, the gradient of each pixel point in the positive direction and the gradient of each pixel point in the negative direction can be calculated according to the characteristic value of each pixel point, so that the gradient peak value in each direction is determined by the gradient in the positive direction and the gradient in the negative direction, and then the frame image position corresponding to the gradient peak value is identified from the frame image.

Step 204, counting the number of pixels corresponding to the gradient peak value in the frame image position.

In the embodiment of the invention, the number of the pixel points in the surrounding area can be preset by taking the pixel point corresponding to the gradient peak value as the center in the statistical frame image position.

And step 205, screening segmentation positions of the text region from the frame image positions according to the number of pixels.

In the embodiment of the invention, the segmentation position can be screened from the frame image position according to the number of the pixel points, the preset threshold value and the position relation between the frame image position and the central axis of the text region.

At step 206, the text region is segmented along the segmentation locations.

In the embodiment of the present invention, the text region may be divided left and right or up and down along the dividing position, that is, one text region may be divided into left and right text regions or up and down text regions along the dividing position.

In an exemplary embodiment of the present invention, one implementation manner of selecting a segmentation position of a text region from frame image positions according to the number of pixels is to determine whether there are at least two frame image positions, the number of pixels of which is greater than a preset threshold and are located on the same side of a central axis of the text region, and if there are at least two frame image positions, the number of pixels of which is greater than the preset threshold and are located on the same side of the central axis of the text region, using a frame image position, which is close to the central axis, of the at least two frame image positions as the segmentation position. For example, there are frame image positions U1, U2, U3, U4, and U5, and if the number of pixels of the frame image positions U1 and U2 is greater than the preset threshold, the frame image positions U1 and U2 are located on the same side of the central axis of the text region. Of the frame image positions U1 and U2, the frame image position U1 is closer to the center axis of the text region, and the frame image position U1 is taken as the dividing position of the text region. The preset threshold may be an empirical value that is preset, or the preset threshold may be a difference between the number of pixels corresponding to the gradient peak in the current frame image position and the number of pixels in the first position or the number of pixels in the previous position after the number of pixels corresponding to the gradient peak in each frame image position is ordered.

In an exemplary embodiment of the present invention, at least one text region may be included in the frame image, and the text presentation direction in the text region may be generally divided into two directions of a landscape direction and a portrait direction. Thus, in one embodiment of identifying the frame image position corresponding to the gradient peak in each gradient from the frame image, when the frame image includes at least one text region, at least one line position corresponding to the gradient peak in each gradient is identified for at least one text region in which the text presentation direction is transverse. The line position in this embodiment may be a horizontal line position or a line position having an angle with the horizontal line position. Another embodiment of identifying the frame image position corresponding to the gradient peak value in each gradient from the frame image is to identify at least one column position corresponding to the gradient peak value in each gradient for at least one text region with a longitudinal text display direction when the frame image contains at least one text region. The column position in this embodiment may be a vertical column position or a column position having a certain angle with respect to the vertical column position.

In one exemplary embodiment of the present invention, since the gradient is a vector, a gradient in the positive direction and a gradient in the negative direction are included. In practical application, if the text display direction in the text region is transverse, one implementation mode of identifying at least one row position corresponding to the gradient peak value in each gradient is to identify the upper boundary of the text region with the text display direction being transverse when the gradient peak value is a positive gradient peak value, and identify the lower boundary of the text region with the text display direction being transverse when the gradient peak value is a negative gradient peak value. If the text display direction in the text region is vertical, one implementation mode of identifying at least one column position corresponding to the gradient peak value in each gradient is to identify the right boundary of the text region with the text display direction being vertical when the gradient peak value is a positive gradient peak value, and identify the left boundary of the text region with the text display direction being vertical when the gradient peak value is a negative gradient peak value.

In an exemplary embodiment of the present invention, one implementation manner of counting the number of pixels corresponding to the gradient peak value in the frame image position is counting the number of pixels in a preset surrounding area of the pixels corresponding to the gradient peak value in the frame image position. From the above, the frame image position may be an upper boundary, a lower boundary, a left boundary, or a right boundary of the text region. For example, the frame image position U is the upper boundary of the text region. And counting the pixel points d corresponding to the gradient peak values of the frame image position U in a preset surrounding area [ d _xy-dis,d_xy +dis ] of the pixel points d to obtain the pixel point number UX. Wherein d _xy represents an abscissa value and an ordinate value of the pixel point d, and dis represents a preset coordinate threshold.

In an exemplary embodiment of the present invention, one implementation of extracting the feature value of each pixel point in the frame image is to extract the gray feature value of each pixel point in the frame image. In practical applications, the frame image may be a gray scale image or an RGB image. If the frame image is an RGB image, the RGB image can be further converted into a gray image, and gray characteristic values of all pixel points in the frame image are extracted.

Based on the above description about the embodiment of the text type classification method, a method for classifying a name text type and a line text type in the video data of a variety is described below. As shown in fig. 3, fig. 3 is a flowchart showing steps of a method for classifying a name text type and a line text type in a variety video data.

A text mask image is acquired. The text area in the text mask image is a text area containing the text content of the lines after filtering, or a text area containing the text content of the names of people, or a text area containing the text content of the lines and the text content of the names of people. Judging whether the text area contains a segmentation column, if the text area does not contain the segmentation column, considering that the text area does not contain the name text content and the text area contains the line text content. If the text region includes a division line, a region having an average value of pixel values smaller than 255 is further identified from the text region, the identified region is used as a non-mask region, and the middle line of the identified region is used as a division line. The text region is further divided into a first text mask region and a second text mask region based on the non-mask region or the dividing line. And if the right boundary of the first text mask area is at the left side of the non-mask area or the parting line, the first text mask area is considered to contain the name text content and is of the name text type. If the right boundary of the first text mask area is on the right side of the non-mask area or the parting line, and the distance between the right boundary of the first text mask area and the non-mask area is smaller than 10 pixel units, the first text mask area is considered to contain name text content and is of a name text type. And if the left boundary of the second text mask area is on the right side of the non-mask area or the parting line, the second text mask area is considered to contain the text content of the line, and is the text type of the line. If the distance from the left boundary of the second text mask area to the left boundary of the text mask image is greater than the distance from the left boundary of the second text mask area to the non-mask area or the parting line, the second text mask area is considered to contain the text content of the line, and is the text type of the line. Under other scenes, the text region can be considered to contain both the speech text content and the name speech content, and then the first text mask region is directly classified as the name text type, and the second text mask region is classified as the speech text type.

As shown in fig. 4, a schematic structural diagram of a text type classifying device according to an embodiment of the present invention is shown. The text type classifying means may include the following modules.

An image acquisition module 41, configured to acquire a text mask image to be processed, where the text mask image includes at least one text region;

A region segmentation module 42 for determining, for each of the text regions, a non-mask region for each of the text regions, and segmenting each of the text regions into a first text mask region and a second text mask region based on each of the non-mask regions;

A type classification module 43, configured to determine, for each of the text regions, a text type of the first text mask region based on a positional relationship between the non-mask region and the first text mask region;

the type classification module 43 is further configured to determine, for each text region, a text type of the second text mask region based on a positional relationship among the non-mask region, the second text mask region, and the text mask image.

In an exemplary embodiment of the present invention, the region segmentation module 42 includes:

The pixel value acquisition module is used for acquiring the pixel value of each pixel point in each text region;

and the non-mask area determining module is used for taking an area, of which the pixel value meets a preset pixel condition, as the corresponding non-mask area of the text area.

In an exemplary embodiment of the present invention, the non-mask area determining module is configured to take, as the non-mask area, an area where the average value of the pixel values is smaller than a preset pixel threshold value.

In an exemplary embodiment of the present invention, the type classification module 43 includes:

the coordinate acquisition module is used for acquiring a first abscissa of the non-mask region and a second abscissa of the right boundary of the first text mask region;

And the text type classification module is used for classifying the first text mask area into a first text type if the value of the second abscissa is smaller than the value of the first abscissa, or classifying the first text mask area into the first text type if the value of the second abscissa is larger than the value of the first abscissa and the absolute value of the difference value between the value of the first abscissa and the value of the second abscissa is smaller than a preset value.

In an exemplary embodiment of the present invention, the coordinate acquiring module is further configured to acquire a third abscissa of a left boundary of the second text mask area and a fourth abscissa of a left boundary of the text mask image;

the text type classification module is further configured to classify the second text mask area as the second text type if the value of the third abscissa is greater than the value of the first abscissa, or classify the second text mask area as the second text type if the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is greater than the absolute value of the difference between the value of the third abscissa and the value of the first abscissa.

In an exemplary embodiment of the present invention, the text type classification module is further configured to classify the first text mask area as the second text type and classify the second text mask area as the first text type if the first abscissa, the second abscissa, the third abscissa, and the fourth abscissa at least meet one of preset comparison conditions;

The preset comparison condition includes that the value of the second abscissa is greater than or equal to the value of the first abscissa, the value of the second abscissa is less than or equal to the value of the first abscissa, the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is greater than or equal to a preset value, the value of the third abscissa is less than or equal to the value of the first abscissa, the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is less than or equal to the absolute value of the difference between the value of the third abscissa and the value of the first abscissa.

In an exemplary embodiment of the present invention, the image acquisition module 41 includes:

The pixel accumulation module is used for accumulating the pixel values of all the pixel points in the frame image to which each text region belongs to obtain a pixel value accumulation result;

And the preprocessing module is used for carrying out normalization processing and binarization processing on the pixel value accumulation result to obtain the text mask image.

The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 51, a communication interface 52, a memory 53 and a communication bus 54, wherein the processor 51, the communication interface 52 and the memory 53 complete communication with each other through the communication bus 54,

A memory 53 for storing a computer program;

The processor 51 is configured to execute a program stored in the memory 53, and implement the following steps:

acquiring a text mask image to be processed, wherein the text mask image comprises at least one text region;

Determining a non-mask area of each text area for each text area, and dividing each text area into a first text mask area and a second text mask area based on each non-mask area;

Determining a text type of the first text mask region based on a positional relationship between the non-mask region and the first text mask region for each of the text regions;

For each text region, determining the text type of the second text mask region based on the positional relationship among the non-mask region, the second text mask region and the text mask image.

The determining the non-mask area of each text area comprises the following steps:

acquiring pixel values of all pixel points in each text region;

And taking the area of which the pixel value meets the preset pixel condition as the corresponding non-mask area of the text area.

The step of using the region where the pixel value satisfies the preset pixel condition as the corresponding non-mask region of the text region includes:

And taking a region, of which the average value of the pixel values is smaller than a preset pixel threshold value, as the non-mask region.

The determining the text type of the first text mask area based on the position relation between the non-mask area and the first text mask area comprises the following steps:

Acquiring a first abscissa of the non-mask region and a second abscissa of a right boundary of the first text mask region;

And classifying the first text mask area as the first text type if the value of the second abscissa is smaller than the value of the first abscissa, or classifying the first text mask area as the first text type if the value of the second abscissa is larger than the value of the first abscissa and the absolute value of the difference between the value of the first abscissa and the value of the second abscissa is smaller than a preset value.

The determining the text type of the second text mask area based on the positional relationship among the non-mask area, the second text mask area and the text mask image includes:

Acquiring a first abscissa of the non-mask region, a third abscissa of a left boundary of the second text mask region and a fourth abscissa of a left boundary of the text mask image;

The second text mask area is classified as the second text type if the value of the third abscissa is greater than the value of the first abscissa, or the second text mask area is classified as the second text type if the absolute value of the difference between the value of the third abscissa and the value of the fourth abscissa is greater than the absolute value of the difference between the value of the third abscissa and the value of the first abscissa.

The method further includes obtaining a first abscissa of the non-mask region, a second abscissa of a right boundary of the first text mask region, a third abscissa of a left boundary of the second text mask region, and a fourth abscissa of a left boundary of the text mask image;

If the first abscissa, the second abscissa, the third abscissa and the fourth abscissa at least meet one of preset comparison conditions, classifying the first text mask area as the second text type, and classifying the second text mask area as the first text type;

The obtaining the text mask image to be processed comprises the following steps:

Accumulating the pixel values of all the pixel points in the frame image to which each text region belongs to obtain a pixel value accumulation result;

and carrying out normalization processing and binarization processing on the pixel value accumulation result to obtain the text mask image.

Based on the description of the embodiments of the text type classification method and device, a system for identifying the speech subtitles in the audio and video files is described below. The identification system can be composed of a personal computer or hardware devices such as a server, a server cluster and the like. The recognition system is provided with a speech subtitle detection framework which mainly comprises a speech subtitle detection model, a speech subtitle filtering model, a speech subtitle tracking model and a text classification model.

Referring to fig. 6, a schematic workflow diagram of a system for recognizing a caption of a speech in an audio/video file according to an embodiment of the present invention is shown. In the practical application process, a complete audio and video file is input to the recognition system, and a video frame result is extracted from the audio and video file by using a graphics processor (Graphics Processing Unit, referred to as GPU for short) of hardware equipment, wherein the specific video frame result can be a partial frame image and a full frame image. The partial frame image may be three frame images extracted from the audio/video file within a 1 second period. In order to reduce the calculation amount of the video frame result extraction, the extraction processing may be performed on a part of the video frame image of the audio-video file, for example, only the next 1/3 part of the video frame image. The extracted partial frame image can be continuously written into a memory queue. The extracted full-frame image may be stored in a hard disk.

And starting a plurality of line caption detection models based on multithreading, continuously reading partial frame images from the memory queue, and splicing the read partial frame images. For example, 3 frame partial frame images read successively are stitched into 1 frame image. The speech subtitle detection model locates all the text boxes from the spliced frame images.

The method comprises the steps that the occurrence frequency of all text boxes in time extraction is counted by a speech subtitle filtering model, a high-frequency heat map area is determined according to the occurrence frequency, the high-frequency heat map area is further used as a filtering standard area of speech subtitles and non-speech subtitles, and all text boxes are filtered into speech subtitle text boxes and non-speech subtitle text boxes by the aid of the filtering standard area. Wherein, the text box of the non-speech subtitle is abandoned and does not participate in the subsequent processing.

The line caption tracking model extracts the depth characteristics of optical character recognition (Optical Character Recognition, OCR for short) from the line caption text boxes, and performs tracking processing based on the full frame images stored in the hard disk to obtain the start and stop time information of each line caption text box in the audio and video file.

The text classification model judges language information corresponding to each line caption text box, and depth features are transmitted to a corresponding OCR prediction network according to the language information. And the OCR prediction network recognizes the line captions in the line caption text box by using respective text recognition algorithms to obtain line results. For example, the language information may be in Chinese or English. The chinese language corresponds to a chinese OCR prediction network and the english language corresponds to an english OCR prediction network.

The recognition system provided by the embodiment of the invention constructs a development kit (Software Development Kit, SDK for short) for recognizing the caption frame level of the line. Aiming at the same audio and video file, the ratio of the time for identifying the line caption on the GPU to the time for identifying the line caption through the SDK is about 1:0.11, so that the speed and the accuracy of line caption identification are greatly improved. And generating the plug-in subtitle of the audio and video file according to the recognized speech result, and rapidly converting the embedded subtitle of the audio and video file into the plug-in subtitle.

When the recognition system provided by the embodiment of the invention is used for filtering the text box, the text box is filtered into the speech subtitle text box and the non-speech subtitle text box by utilizing the time domain information of the text box, and the accuracy of text box filtering is high. When the audio and video file is a film and television drama work, the filtering accuracy rate is over 99 percent, and when the audio and video file is a synthetic drama work, the filtering accuracy rate is over 98.5 percent.

The identification system provided by the embodiment of the invention can also extract video frames aiming at appointed areas such as name strips, lyrics plates, lyrics and the like in the variety program, and further perform the processes such as positioning, filtering, tracking, classifying, identifying and the like of the name, the lyrics and the like, thereby realizing character identification of the name, the lyrics and the like.

The recognition system provided by the embodiment of the invention can recognize the Chinese and English line captions, can realize intelligent recognition of Chinese, korean line captions and the like in the bilingual audio and video file, and supports recognition of the line captions of the multilingual audio and video file.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the text type classification method according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of classifying text types as described in any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for classifying text types, comprising:

Determining a text type of the second text mask region based on a positional relationship among the non-mask region, the second text mask region and the text mask image for each of the text regions;

the text mask image is an image after filtering text areas in the text detection image;

The text type comprises any one of a name text area and a line text area box;

the non-mask area is the text area with the pixel value meeting the preset pixel condition.

2. The method of claim 1, wherein said determining the non-masked areas for each of the text areas comprises:

acquiring pixel values of all pixel points in each text region;

3. The method according to claim 2, wherein said taking the region where the pixel value satisfies a preset pixel condition as the non-mask region of the corresponding text region includes:

4. The method of claim 1, wherein the determining the text type of the first text mask region based on the positional relationship between the non-mask region and the first text mask region comprises:

5. The method of claim 1, wherein the determining the text type of the second text mask region based on the positional relationship between the non-mask region, the second text mask region, and the text mask image comprises:

6. The method according to claim 1, wherein the method further comprises:

Acquiring a first abscissa of the non-mask region, a second abscissa of a right boundary of the first text mask region, a third abscissa of a left boundary of the second text mask region and a fourth abscissa of a left boundary of the text mask image;

If the first abscissa, the second abscissa, the third abscissa and the fourth abscissa at least meet one of preset comparison conditions, classifying the first text mask area as a second text type, and classifying the second text mask area as a first text type;

7. The method according to any one of claims 1 to 6, wherein the acquiring a text mask image to be processed comprises:

8. A text type classifying apparatus, comprising:

the image acquisition module is used for acquiring a text mask image to be processed, wherein the text mask image comprises at least one text region;

a region segmentation module, configured to determine, for each text region, a non-mask region of each text region, and segment each text region into a first text mask region and a second text mask region based on each non-mask region;

a type classification module, configured to determine, for each text region, a text type of the first text mask region based on a positional relationship between the non-mask region and the first text mask region;

The type classification module is further configured to determine, for each text region, a text type of the second text mask region based on a positional relationship among the non-mask region, the second text mask region, and the text mask image;

The text type comprises any one of a name text area and a line text area box;

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

A processor for carrying out the method steps of any one of claims 1-7 when executing a program stored on a memory.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.