CN116453133B - Banner Text Detection Method and System Based on Bezier Curve and Key Points - Google Patents
Banner Text Detection Method and System Based on Bezier Curve and Key Points Download PDFInfo
- Publication number
- CN116453133B CN116453133B CN202310714974.5A CN202310714974A CN116453133B CN 116453133 B CN116453133 B CN 116453133B CN 202310714974 A CN202310714974 A CN 202310714974A CN 116453133 B CN116453133 B CN 116453133B
- Authority
- CN
- China
- Prior art keywords
- text
- key point
- points
- coordinates
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
- G06V30/1801—Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/42—Document-oriented image-based pattern recognition based on the type of document
 
- 
        - Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
 
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Image Generation (AREA)
Abstract
本发明公开了一种基于贝塞尔曲线和关键点的横幅文本检测方法及系统,首先根据图像标签生成文本区域的初始文本框,接着利用固定阈值精简初始文本框长边坐标数量,基于精简后的长边坐标点生成贝塞尔曲线,将两条贝塞尔曲线首尾相连构成新的文本框,并将文本框的标签由文本框边界坐标点转变为关键点坐标和关键点的宽度,然后构建并训练横幅文本检测网络模型,最后运用训练好的横幅文本检测网络模型对横幅图像中的文本进行检测。本发明解决了现有技术无法准确框定横幅文本的问题,能够在完成文本检测的同时,提高检测速度,并使用更少的数据。
The invention discloses a banner text detection method and system based on Bezier curves and key points. First, an initial text box of a text area is generated according to an image label, and then a fixed threshold is used to simplify the long side coordinates of the initial text box. Based on the simplified The long side coordinate points of the Bezier curve are generated, and the two Bezier curves are connected end to end to form a new text box, and the label of the text box is changed from the text box boundary coordinate point to the key point coordinate and the width of the key point, and then Build and train the banner text detection network model, and finally use the trained banner text detection network model to detect the text in the banner image. The invention solves the problem that the prior art cannot accurately frame the banner text, can improve the detection speed and use less data while completing the text detection.
Description
技术领域technical field
本发明属于自然场景文本定位技术领域,具体涉及一种基于贝塞尔曲线和关键点的横幅文本检测方法及系统。The invention belongs to the technical field of natural scene text positioning, and in particular relates to a banner text detection method and system based on Bezier curves and key points.
背景技术Background technique
随着计算机视觉的不断发展,目标检测与语义分割也在不断迭代更新。运用目标检测和语义分割等技术对横幅中的文本区域和非文本区域进行划分,实现横幅文本的检测,进而识别文本区域内容。With the continuous development of computer vision, target detection and semantic segmentation are also being updated iteratively. Use technologies such as target detection and semantic segmentation to divide the text area and non-text area in the banner, realize the detection of the banner text, and then identify the content of the text area.
通过对横幅图像的文本区域进行研究,发现横幅文本存在文本长宽比大和文本扭曲等问题。尽管目标检测方法在处理该类文本方面已取得了一定的进展,但仍存在误检、漏检以及检测到的文本内存在较多非文本区域等问题。同时,采用语义分割的方法对这类文本进行检测也存在一些问题,例如后处理复杂、检测速度慢和对硬件设备要求较高等。因此,为解决这些问题,急需提出一种新的文本检测方法。By studying the text area of the banner image, it is found that the banner text has problems such as large text aspect ratio and text distortion. Although object detection methods have made some progress in dealing with this type of text, there are still problems such as false detection, missed detection, and many non-text areas in the detected text. At the same time, there are still some problems in using semantic segmentation to detect this kind of text, such as complex post-processing, slow detection speed and high requirements for hardware equipment. Therefore, to solve these problems, it is urgent to propose a new text detection method.
发明内容Contents of the invention
针对现有技术无法准确框定横幅文本的问题,本发明提供一种基于贝塞尔曲线和关键点的横幅文本检测方法及系统。Aiming at the problem that the existing technology cannot accurately frame the banner text, the present invention provides a banner text detection method and system based on Bezier curves and key points.
为了实现上述目的,本发明提供一种基于贝塞尔曲线和关键点的横幅文本检测方法,包括以下步骤:In order to achieve the above object, the present invention provides a kind of banner text detection method based on Bezier curve and key point, comprises the following steps:
步骤1,根据公共文本数据集的标签生成文本区域的初始文本框,通过固定阈值精简文本框长边坐标数量,基于精简后的长边坐标点生成贝塞尔曲线,将两条贝塞尔曲线首尾连接构成新的文本框,并将文本框的标签由文本框边界坐标点转变为关键点坐标和关键点的宽度;Step 1, generate the initial text box of the text area according to the label of the public text data set, reduce the number of long side coordinates of the text box through a fixed threshold, generate a Bezier curve based on the simplified long side coordinate points, and combine the two Bezier curves Connect the end to end to form a new text box, and change the label of the text box from the coordinate point of the border of the text box to the coordinates of the key point and the width of the key point;
步骤1.1,选取公共文本图像数据集中存在长文本、扭曲文本等特殊文本的图像作为数据集,根据公共文本数据集的标签生成文本区域的初始文本框;Step 1.1, select images with special text such as long text and distorted text in the public text image data set as the data set, and generate the initial text box of the text area according to the label of the public text data set;
步骤1.2,采用固定阈值的方法对文本框长边的弯曲程度进行判定;Step 1.2, using a fixed threshold method to determine the degree of curvature of the long side of the text box;
步骤1.3,根据文本框两条长边的弯曲程度,选择性的对文本框两条长边的坐标点进行精简;Step 1.3, according to the degree of curvature of the two long sides of the text box, selectively simplify the coordinate points of the two long sides of the text box;
步骤1.4,将精简后的两条长边上的坐标点作为贝塞尔曲线的控制点,生成两条相应的贝塞尔曲线,将两条贝塞尔曲线首尾连接得到该文本的真实边界框;Step 1.4, use the simplified coordinate points on the two long sides as the control points of the Bezier curve, generate two corresponding Bezier curves, and connect the two Bezier curves end to end to get the real bounding box of the text ;
步骤1.5,将公共数据集的标签由文本框边界坐标点转变为关键点坐标和关键点的宽度;Step 1.5, transform the label of the public data set from the coordinate point of the border of the text box into the coordinate of the key point and the width of the key point;
步骤2,构建横幅文本检测网络模型;Step 2, build a banner text detection network model;
步骤3,利用步骤1得到的关键点数据集对步骤2构建的横幅文本检测网络模型进行训练;Step 3, using the key point data set obtained in step 1 to train the banner text detection network model constructed in step 2;
步骤4,运用训练好的横幅文本检测网络模型检测横幅图像中的文本。Step 4, use the trained banner text detection network model to detect the text in the banner image.
而且,所述步骤1.1中公共文本数据集的标签为多组按顺时针排列的坐标,每组坐标为框定文本的文本框边界点坐标,将每组坐标按顺时针相连,形成闭合多边形,就得到了该文本的初始文本框,设数据集图像边界点数量为个,按顺序选取前/>个作为上边界点,后/>个作为下边界点,将上边界点的连线、下边界点的连线作为初始文本框的两条长边。Moreover, the label of the public text data set in the step 1.1 is a plurality of groups of coordinates arranged clockwise, each group of coordinates is the coordinates of the text frame boundary points of the framed text, and each group of coordinates is connected clockwise to form a closed polygon. The initial text box of the text is obtained, and the number of boundary points of the dataset image is , select the previous in order /> as the upper boundary point, then /> As the lower boundary point, the connection line of the upper boundary point and the connection line of the lower boundary point are used as the two long sides of the initial text box.
而且,所述步骤1.2中通过固定阈值对数据集中文本框长边的首尾坐标连线距离和长边上其他坐标点到该连线的距离进行比较,判断文本框两条长边的弯曲程度,即:Moreover, in the step 1.2, the distance between the first and last coordinates of the long side of the text box in the data set is compared with the distance from other coordinate points on the long side to the line by a fixed threshold to determine the degree of curvature of the two long sides of the text box. Right now:
(1) (1)
式中,表示文本框长边的弯曲程度,/>表示图像数据集中文本框长边上的坐标点到长边首尾坐标连线的最远距离与首尾坐标连线距离的比值,当该比值大于等于0,且小于/>时,判定该长边为直线,当该比值大于等于/>,且小于/>时,判定该长边部分弯曲,当该比值大于等于/>时,判定该长边完全弯曲,/>、/>为设定的阈值。In the formula, Indicates the curvature of the long side of the text box, /> Indicates the ratio of the furthest distance from the coordinate point on the long side of the text box in the image data set to the line between the first and last coordinates of the long side and the distance between the first and last coordinates. When the ratio is greater than or equal to 0 and less than /> , it is judged that the long side is a straight line, when the ratio is greater than or equal to /> , and less than /> , it is judged that the long side part is bent, when the ratio is greater than or equal to /> , it is judged that the long side is completely bent, /> , /> is the set threshold.
而且,所述步骤1.3中设长边上的坐标点到首尾坐标点连线的距离为,首尾坐标点分别为/>、/>,具体精简过程如下:当判定长边为直线时,仅保留长边首尾坐标点;当判定长边为部分弯曲时,保留距离首尾坐标连线最远的坐标点和首尾坐标点;当判定长边为完全弯曲时,设定阈值/>为首尾坐标连线长度的0.1倍,当/>大于/>时,保留对应的坐标点,舍弃其他坐标点;设/>最大的坐标点为/>,用/>将曲线分为/>,/>两部分,重复上述操作,直到无坐标点到连线距离大于/>为止。And, set the coordinate point on the long side to the distance of the line connecting the first and last coordinate points in the step 1.3 as , the first and last coordinate points are respectively /> , /> , the specific simplification process is as follows: when it is judged that the long side is a straight line, only the first and last coordinate points of the long side are reserved; Set the threshold when the edge is fully curved /> It is 0.1 times the length of the line connecting the first and last coordinates, when /> greater than /> , keep the corresponding coordinate points and discard other coordinate points; set /> The largest coordinate point is /> , use /> divide the curve into /> , /> Two parts, repeat the above operation until the distance from the point without coordinates to the line is greater than /> until.
而且,所述步骤1.4中将精简后长边上的坐标点作为贝塞尔曲线的控制点,贝塞尔曲线使用以伯恩斯坦多项式为基的参数曲线表示,具体定义如下式所示:Moreover, in the step 1.4, the coordinate points on the long side after simplification are used as the control points of the Bezier curve, and the Bezier curve is represented by a parameter curve based on the Bernstein polynomial, and the specific definition is as shown in the following formula:
(2) (2)
(3) (3)
式中,表示贝塞尔曲线上点的坐标集合,/>表示贝塞尔曲线阶数,/>表示第个控制点的坐标,/>表示第/>个控制点的伯恩斯坦多项式,/>表示二项式系数,/>表示时间,当对应贝塞尔曲线上所有点的坐标,由于/>或1时,/>的值为0,因此当/>时,选取长边上的第一个坐标点作为0时刻贝塞尔曲线的位置坐标,当/>时,选取长边上的最后一个坐标点作为1时刻贝塞尔曲线的位置坐标。In the formula, Represents the coordinate set of points on the Bezier curve, /> Indicates the order of the Bezier curve, /> Indicates the first coordinates of control points, /> Indicates the first /> Bernstein polynomials for control points, /> represents the binomial coefficient, /> Indicates time, when corresponding to the coordinates of all points on the Bezier curve, because /> or 1, /> has a value of 0, so when /> , select the first coordinate point on the long side as the position coordinate of the Bezier curve at time 0, when /> , select the last coordinate point on the long side as the position coordinate of the Bezier curve at time 1.
通过式(2)生成两条贝塞尔曲线,将两条贝塞尔曲线首尾连接构成的闭合多边形作为该文本实例的真实文本框。Two Bezier curves are generated by formula (2), and the closed polygon formed by connecting the two Bezier curves end to end is used as the real text box of the text instance.
而且,所述步骤1.5中将两条长边上的边界点转化为一组关键点来表示文本框,在转化为关键点之前,采取向上兼容的方式保证文本框上下两条长边的边界点数量一致,具体步骤如下:当上下两条边分别为直线和部分弯曲时,提取直线边的中点作为其中一个边界点,使得上下两边边界点为三个;当上下两条边分别为直线和完全弯曲时,按完全弯曲边坐标点数量对直线边进行等分,提取等分的坐标点,使得上下两边边界点数量一致;当上下两条边分别为部分弯曲和完全弯曲时,依照完全弯曲边坐标点数量减去部分弯曲边坐标点数量对部分弯曲边的两条曲线等分,提取等分的坐标点,使得上下两边边界点数量一致。经过上述操作使上下边界点数量统一后,再对边界点进行转化,将上下边的坐标从首到尾一一对应,取对应坐标点的中点坐标作为关键点坐标,对应坐标点距离的二分之一作为关键点的宽度,至此公共图像文本数据集中的标签由边界框的坐标点转变为一组关键点坐标和对应的宽度。Moreover, in the step 1.5, the boundary points on the two long sides are converted into a set of key points to represent the text box, and before being converted into key points, an upwardly compatible method is adopted to ensure the boundary points of the upper and lower two long sides of the text box The numbers are the same, and the specific steps are as follows: when the upper and lower sides are straight and partially curved, extract the midpoint of the straight side as one of the boundary points, so that there are three boundary points on the upper and lower sides; when the upper and lower sides are straight and partially curved, respectively When fully bent, divide the straight line equally according to the number of coordinate points on the fully bent side, and extract the equally divided coordinate points so that the number of boundary points on the upper and lower sides is the same; when the upper and lower sides are partially bent and fully bent, follow the complete bending The number of edge coordinate points minus the number of partially curved edge coordinate points divides the two curves of the partially curved edge equally, and extracts the equally divided coordinate points so that the number of boundary points on the upper and lower sides is the same. After the above operations make the number of upper and lower boundary points uniform, then transform the boundary points, and correspond the coordinates of the upper and lower sides from the beginning to the end one by one, take the midpoint coordinates of the corresponding coordinate points as the key point coordinates, and the two corresponding distances One-half is used as the width of the key point, so far the label in the public image text dataset is transformed from the coordinate point of the bounding box into a set of key point coordinates and corresponding width.
而且,所述步骤2中横幅文本检测网络模型包括特征提取模块、特征融合模块、回归模块和文本框生成模块。特征提取模块,用于提取不同层次的特征信息,得到从低层到高层的包含语义信息的特征图像。特征融合模块,用于将不同层次的特征图像进行合并,得到融合的特征图像,用于后续对横幅文本检测。回归模块,用于回归文本实例形状,以及文本实例的关键点坐标和关键点的宽度。文本框生成模块,用于基于当前图像中的关键点坐标和宽度信息,生成横幅图像文本框。Moreover, the banner text detection network model in step 2 includes a feature extraction module, a feature fusion module, a regression module and a text box generation module. The feature extraction module is used to extract feature information at different levels, and obtain feature images containing semantic information from low-level to high-level. The feature fusion module is used for merging feature images of different levels to obtain a fused feature image for subsequent detection of banner text. The regression module is used to regress the shape of the text instance, as well as the key point coordinates and the width of the key point of the text instance. The text box generation module is used to generate a banner image text box based on the key point coordinates and width information in the current image.
特征提取模块主干网络采用ResNet-50模型,将图像输入到ResNet-50模型后,通过通道增加和下采样处理依次得到四张特征图像、/>、/>、/>,对主干网络中得到的四张不同尺度的特征图像的通道数量进行统一处理得到/>、/>、/>、/>,然后从最低尺度的特征图/>开始进行上采样处理,并与FPN结构输入端同尺度的特征图/>进行相加操作,得到融合后的较低尺度的特征图像/>,对/>进行上采样后与/>相加,得到融合后的低尺度的特征图像/>,同样对/>进行上采样后与/>相加,得到融合后的特征图像/>,最后将融合后的特征图像/>、/>、/>、/>作为FPN的输出。The backbone network of the feature extraction module adopts the ResNet-50 model. After inputting the image into the ResNet-50 model, four feature images are sequentially obtained through channel increase and downsampling. , /> , /> , /> , the number of channels of the four feature images of different scales obtained in the backbone network is uniformly processed to obtain /> , /> , /> , /> , then from the lowest scale feature map /> Start the upsampling process, and the feature map with the same scale as the input of the FPN structure /> Perform an addition operation to obtain the fused lower-scale feature image/> , right /> After upsampling with /> Add to get the fused low-scale feature image/> , same for /> After upsampling with /> Add to get the fused feature image /> , and finally the fused feature image /> , /> , /> , /> as the output of the FPN.
特征融合模块是将不同尺度的融合特征图像进行合并,得到合并后的融合特征图像,具体计算过程如下:The feature fusion module is to merge the fusion feature images of different scales to obtain the merged fusion feature image , the specific calculation process is as follows:
(4) (4)
式中,表示通道连接,/> 和/>分别为2倍、4倍和8倍上采样,/>、、/>、/>为融合后的特征图像。In the formula, Indicates a channel connection, /> and /> 2x, 4x, and 8x upsampling respectively, /> , , /> , /> is the fused feature image.
将融合特征图像进行上采样处理,使得/>与原始图像大小相同。will fuse feature images Perform upsampling so that /> Same size as original image.
回归模块包括形状回归和关键点回归两部分,形状回归通过激活函数的卷积层将融合特征图转化为文本形状特征图,通过设定阈值为/>对该特征图进行二值化,高于阈值/>的区域作为文本区域,低于阈值/>的区域为背景区域,得到文本与背景分离的文本形状二值图。将该二值图中的文本轮廓形状与图像关键点标签生成的文本框形状做比较,通过比较两者交并比IOU对二值图中的文本轮廓形状与图像关键点标签生成的文本框形状进行匹配。关键点回归的输入是融合特征图/>,输出是关键点坐标和宽度,包括两个分支,其中一个分支的输出是/>张关键点热图,/>为被检测图像文本实例中关键点最多的关键点数,选取关键点热图中得分最高的/>个高亮坐标点为这张关键点热图中的关键点坐标,其也是这张图像每个文本实例在这一类关键点对应的关键点坐标,/>为被检测图像的文本实例个数,文本实例关键点数量不足的,高亮坐标个数相应减少,另一个分支检测的输出个宽度信息,宽度信息与关键点一一对应,文本实例关键点数量不足的,剩余宽度信息取0。The regression module includes two parts: shape regression and key point regression. The shape regression will fuse the feature map through the convolution layer of the activation function Converted to a text shape feature map, by setting the threshold as /> Binarize the feature map above the threshold /> The area is used as the text area, below the threshold /> The region of is the background region, and the text shape binary image separated from the background is obtained. Compare the shape of the text outline in the binary image with the shape of the text box generated by the image key point label, and compare the IOU to the shape of the text outline in the binary image and the shape of the text box generated by the image key point label to match. The input for key point regression is the fused feature map /> , the output is the key point coordinates and width, including two branches, one of which outputs is /> heat map of key points, /> For the number of key points with the most key points in the detected image text instance, select the one with the highest score in the key point heat map The highlighted coordinate points are the key point coordinates in this key point heat map, which are also the key point coordinates of each text instance in this image corresponding to this type of key point, /> is the number of text instances in the detected image, if the number of key points of the text instance is insufficient, the number of highlighted coordinates will be reduced accordingly, and the output of another branch detection Width information, the width information corresponds to the key points one by one, if the number of key points of the text instance is insufficient, the remaining width information is set to 0.
文本框生成模块将回归模块输出的关键点坐标和宽度信息作为文本实例信息,并用该信息来生成文本框。关键点宽度为关键点到对应长边坐标点的距离,以两个相邻关键点的连线作为关键点与长边坐标点连线的法线,关键点垂直该法线向上下延伸该关键点对应宽度距离,终点坐标为长边坐标点。按上述操作对每个坐标点进行处理,得到两组与关键点数量相同的长边坐标点,将长边坐标点作为贝塞尔曲线的控制点生成得到两条贝塞尔曲线,将两条贝塞尔曲线首尾相连,得到完全闭合的曲线框,该曲线框即为该文本实例的文本框。最后将框定文本的图像输出,实现横幅图像的文本检测。The text box generation module takes the key point coordinates and width information output by the regression module as text instance information, and uses this information to generate a text box. The key point width is the distance from the key point to the corresponding long-side coordinate point. The line connecting two adjacent key points is used as the normal line connecting the key point and the long-side coordinate point. The key point is perpendicular to the normal line and extends the key up and down. The point corresponds to the width distance, and the end point coordinates are the long side coordinate points. According to the above operation, each coordinate point is processed to obtain two groups of long-side coordinate points with the same number of key points, and the long-side coordinate points are used as the control points of the Bezier curve to generate two Bezier curves, and the two Bezier curves are connected end to end to obtain a completely closed curved box, which is the text box of the text instance. Finally, the image of the framed text is output to realize the text detection of the banner image.
而且,所述步骤3中将步骤1中得到的关键点数据集分为训练集和测试集,把训练集输入到横幅文本检测网络模型中进行迭代训练,更新横幅文本检测网络模型的参数,使损失函数最小化,记录测试集测试模型的准确率,保存最优的模型。训练过程分为形状检测训练和关键点检测训练,对应的损失函数计算方式如下:And, in described step 3, the key point dataset obtained in step 1 is divided into a training set and a test set, the training set is input into the banner text detection network model for iterative training, and the parameters of the banner text detection network model are updated, so that Minimize the loss function, record the accuracy of the test model in the test set, and save the optimal model. The training process is divided into shape detection training and key point detection training, and the corresponding loss function It is calculated as follows:
(5) (5)
式中,为形状损失函数,/>为关键点损失函数,/>为损失函数的权重因子。In the formula, is the shape loss function, /> is the key point loss function, /> is the weight factor of the loss function.
形状损失函数的计算方式如下:shape loss function is calculated as follows:
(6) (6)
式中,表示回归出的文本轮廓形状和关键点标签生成的文本框的交并比,/>和/>分别表示回归出的文本轮廓形状和关键点标签生成的文本框的中心点坐标,回归出的文本轮廓形状的中心点坐标为文本轮廓形状的关键点中顺时针方向中位数的关键点坐标,关键点个数为双数时选择最中间两关键点连线的中心点坐标,关键点标签生成的文本框的中心点坐标为生成的文本框的关键点中顺时针方向中位数的关键点坐标,关键点个数为双数时选择最中间两关键点连线的中心点坐标,/>表示两个中心点的欧氏距离,/>表示能够同时包含回归出的文本轮廓形状和关键点标签生成的文本框的最小闭包区域的对角线长度,/>作为调节因子,用于平衡重叠面积和长宽比相似性之间的权重,/>是衡量长宽比相似性的指标。In the formula, Indicates the intersection and union ratio of the regressed text outline shape and the text box generated by the key point label, /> and /> represent the center point coordinates of the text box generated by the regressed text outline shape and the key point label, and the center point coordinates of the regressed text outline shape are the key point coordinates of the clockwise median of the key points of the text outline shape, When the number of key points is an even number, select the center point coordinates of the line connecting the two most middle key points, and the center point coordinates of the text box generated by the key point label is the key point of the clockwise median of the key points of the generated text box Coordinates, when the number of key points is an even number, select the coordinates of the center point of the line connecting the two most middle key points, /> Indicates the Euclidean distance between two center points, /> Indicates the diagonal length of the minimum closure area that can contain both the regressed text outline shape and the text box generated by the key point label, /> As an adjustment factor to balance the weight between overlapping area and aspect ratio similarity, /> is a measure of aspect ratio similarity.
关键点损失函数包括关键点坐标和宽度两个部分,具体计算公式如下:key point loss function Including key point coordinates and width, the specific calculation formula is as follows:
(7) (7)
(8) (8)
(9) (9)
式中,为关键点坐标损失函数,/>为关键点宽度损失函数,/>为权重因子,/>是图像中文本实例的数量,/>表示回归关键点热图的通道数量,/>和/>分别表示回归关键点热图的高和宽,/>是回归模块回归的关键点热图中关键点/>的得分,表示带有关键点标签的图像经过高斯函数计算得到的真实关键点热图的坐标点得分,和/>是控制每个关键点贡献的超参数,通过/>来减少对关键点坐标周围点的惩罚,/>表示返回括号内数字的绝对值。In the formula, is the key point coordinate loss function, /> is the key point width loss function, /> is the weight factor, /> is the number of text instances in the image, /> Indicates the number of channels of the regression keypoint heatmap, /> and /> Indicate the height and width of the regression key point heat map, respectively, /> is the key point in the key point heatmap of the regression module regression /> score, Indicates the coordinate point score of the real key point heat map calculated by the Gaussian function of the image with the key point label, and /> is a hyperparameter that controls the contribution of each keypoint, via /> To reduce the penalty for points around the key point coordinates, /> Indicates to return the absolute value of the number enclosed in parentheses.
为了加速模型收敛,在关键点坐标回归时,不考虑非文本区域坐标点,以减少负样本数量。使用训练集数据对横幅文本检测网络模型进行训练后,将测试集放入模型中,比较文本检测的准确率和检测速度,提取出最优的检测模型。In order to speed up the model convergence, when the key point coordinates are regressed, the non-text area coordinate points are not considered to reduce the number of negative samples. After using the training set data to train the banner text detection network model, put the test set into the model, compare the accuracy and detection speed of text detection, and extract the optimal detection model.
本发明还提供一种基于贝塞尔曲线和关键点的横幅文本检测系统,用于实现如上所述的一种基于贝塞尔曲线和关键点的横幅文本检测方法。The present invention also provides a banner text detection system based on Bezier curves and key points, which is used to realize the above-mentioned banner text detection method based on Bezier curves and key points.
而且,包括处理器和存储器,存储器用于存储程序指令,处理器用于调用存储器中的存储指令执行如上所述的一种基于贝塞尔曲线和关键点的横幅文本检测方法。Moreover, it includes a processor and a memory, the memory is used to store program instructions, and the processor is used to call the stored instructions in the memory to execute the above-mentioned method for detecting banner text based on Bezier curves and key points.
与现有技术相比,本发明具有如下优点:Compared with prior art, the present invention has following advantage:
1)使用一组关键点代替矩形框进行回归,并通过关键点生成文本框的方式代替形状固定的锚框来表示文本实例,同时用贝塞尔曲线代替文本框的长边,利用贝塞尔曲线的形状多变性从而适应不同文本形状,避免了因锚框形状固定而无法准确表示文本实例形状的问题。1) Use a set of key points instead of a rectangular frame for regression, and use the key points to generate a text box instead of a fixed-shape anchor box to represent the text instance, and replace the long side of the text box with a Bezier curve, using Bezier The shape variability of the curve adapts to different text shapes, avoiding the problem that the shape of the text instance cannot be accurately represented due to the fixed shape of the anchor box.
2)为了降低文本框坐标回归的计算压力,在关键点标签制作阶段,采用自适应的方式精简长边坐标点数量,以减少文本框坐标回归的计算压力;用关键点坐标和宽度取代文本框坐标进行回归,显著降低了回归标签的计算成本,从而在完成文本检测的同时,提高检测速度并使用更少的数据。2) In order to reduce the calculation pressure of the text box coordinate regression, in the key point label production stage, the number of long side coordinate points is reduced in an adaptive way to reduce the calculation pressure of the text box coordinate regression; the text box is replaced by the key point coordinates and width Coordinates are regressed, which significantly reduces the computational cost of regressing labels, thereby improving detection speed and using less data while completing text detection.
附图说明Description of drawings
图1为本发明实施例的流程图。Fig. 1 is a flowchart of an embodiment of the present invention.
图2为本发明实施例横幅文本检测网络的结构图。FIG. 2 is a structural diagram of a banner text detection network according to an embodiment of the present invention.
具体实施方式Detailed ways
本发明提供一种基于贝塞尔曲线和关键点的横幅文本检测方法及系统,下面结合附图和实施例对本发明的技术方案作进一步说明。The present invention provides a banner text detection method and system based on Bezier curves and key points. The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.
实施例一Embodiment one
如图1所示,本发明提供一种基于贝塞尔曲线和关键点的横幅文本检测方法,包括以下步骤:As shown in Figure 1, the present invention provides a kind of banner text detection method based on Bezier curve and key point, comprises the following steps:
步骤1,根据公共文本数据集的标签生成文本区域的初始文本框,通过固定阈值精简文本框长边坐标数量,基于精简后的长边坐标点生成贝塞尔曲线,将两条贝塞尔曲线首尾连接构成新的文本框,并将文本框的标签由文本框边界坐标点转变为关键点坐标和关键点的宽度。Step 1, generate the initial text box of the text area according to the label of the public text data set, reduce the number of long side coordinates of the text box through a fixed threshold, generate a Bezier curve based on the simplified long side coordinate points, and combine the two Bezier curves Connect the end to end to form a new text box, and change the label of the text box from the coordinate point of the text box boundary to the coordinate of the key point and the width of the key point.
步骤1.1,选取公共文本图像数据集中存在长文本、扭曲文本等特殊文本的图像作为数据集,根据公共文本数据集的标签生成文本区域的初始文本框。Step 1.1, select images with special text such as long text and distorted text in the public text image data set as the data set, and generate the initial text box of the text area according to the labels of the public text data set.
公共文本数据集的标签为多组按顺时针排列的坐标,每组坐标为框定文本的文本框边界点坐标,将每组坐标按顺时针相连,形成闭合多边形,就得到了该文本的文本框,也即初始文本框。本实施例中ctw-1500数据集边界点数量为14个,按顺序选取前7个作为上边界点,后7个作为下边界点,将上边界点(1~7)的连线、下边界点(8~14)的连线作为初始文本框长边。The labels of the public text data set are multiple sets of coordinates arranged clockwise. Each set of coordinates is the coordinates of the boundary points of the text box that frame the text. Connect each set of coordinates clockwise to form a closed polygon, and the text box of the text is obtained. , which is the initial text box. In this example, the number of boundary points in the ctw-1500 data set is 14, and the first 7 are selected as the upper boundary points in order, and the last 7 are used as the lower boundary points. The connecting line of points (8~14) is used as the long side of the initial text box.
步骤1.2,采用固定阈值的方法对文本框长边的弯曲程度进行判定。In step 1.2, a fixed threshold method is used to determine the degree of curvature of the long side of the text box.
通过固定阈值对数据集中文本框长边的首尾坐标连线距离和长边上其他坐标点到该连线的距离进行比较,判断文本框两条长边的弯曲程度,即:Compare the distance between the first and last coordinates of the long side of the text box in the data set and the distance from other coordinate points on the long side to the line through a fixed threshold to determine the curvature of the two long sides of the text box, namely:
(1) (1)
式中,表示文本框长边的弯曲程度,/>表示图像数据集中文本框长边上的坐标点到长边首尾坐标连线的最远距离与首尾坐标连线距离的比值,当该比值大于等于0,且小于0.1时,判定该长边为直线,当该比值大于等于0.1,且小于0.7时,判定该长边部分弯曲,当该比值大于等于0.7时,判定该长边完全弯曲。In the formula, Indicates the curvature of the long side of the text box, /> Indicates the ratio of the furthest distance from the coordinate point on the long side of the text box in the image data set to the line between the first and last coordinates of the long side and the distance between the first and last coordinates. When the ratio is greater than or equal to 0 and less than 0.1, the long side is determined to be a straight line , when the ratio is greater than or equal to 0.1 and less than 0.7, it is determined that the long side is partially bent, and when the ratio is greater than or equal to 0.7, it is determined that the long side is completely bent.
步骤1.3,根据文本框两条长边的弯曲程度,选择性的对文本框两条长边的坐标点进行精简。In step 1.3, according to the degree of curvature of the two long sides of the text box, the coordinate points of the two long sides of the text box are selectively simplified.
设长边上的坐标点到首尾坐标点连线的距离为,首尾坐标点分别为/>、/>,具体精简过程如下:当判定长边为直线时,仅保留长边首尾坐标点;当判定长边为部分弯曲时,保留距离首尾坐标连线最远的坐标点和首尾坐标点;当判定长边为完全弯曲时,受Douglas-Peucker算法启发,设定阈值/>为首尾坐标连线长度的0.1倍,当/>大于/>时,保留对应的坐标点,舍弃其他坐标点;设/>最大的坐标点为/>,用/>将曲线分为/>,两部分,重复上述操作,直到无坐标点到连线距离大于/>为止。Let the distance from the coordinate point on the long side to the line connecting the first and last coordinate points be , the first and last coordinate points are respectively /> , /> , the specific simplification process is as follows: when it is judged that the long side is a straight line, only the first and last coordinate points of the long side are reserved; When the edge is completely curved, inspired by the Douglas-Peucker algorithm, set the threshold /> It is 0.1 times the length of the line connecting the first and last coordinates, when /> greater than /> , keep the corresponding coordinate points and discard other coordinate points; set /> The largest coordinate point is /> , use /> divide the curve into /> , Two parts, repeat the above operation until the distance from the point without coordinates to the line is greater than /> until.
通过上述操作,实现了对长边上坐标点的精简,减少了标签数量,进而降低了横幅图像文本检测模型回归模块的计算量,提高了检测速度。Through the above operations, the simplification of the coordinate points on the long side is realized, the number of labels is reduced, and the calculation amount of the regression module of the banner image text detection model is reduced, and the detection speed is improved.
步骤1.4,将精简后的两条长边上的坐标点作为贝塞尔曲线的控制点,生成两条相应的贝塞尔曲线,将两条贝塞尔曲线首尾连接得到该文本的真实边界框。Step 1.4, use the simplified coordinate points on the two long sides as the control points of the Bezier curve, generate two corresponding Bezier curves, and connect the two Bezier curves end to end to get the real bounding box of the text .
将精简后长边上的坐标点作为贝塞尔曲线的控制点,贝塞尔曲线使用以伯恩斯坦多项式为基的参数曲线表示,具体定义如下式所示:The coordinate points on the long side after simplification are used as the control points of the Bezier curve. The Bezier curve is represented by a parametric curve based on the Bernstein polynomial. The specific definition is shown in the following formula:
(2) (2)
(3) (3)
式中,表示贝塞尔曲线上点的坐标集合,/>表示贝塞尔曲线阶数,/>表示第个控制点的坐标,/>表示第/>个控制点的伯恩斯坦多项式,/>表示二项式系数,/>表示时间,当对应贝塞尔曲线上所有点的坐标,由于/>或1时,/>的值为0,因此当/>时,选取长边上的第一个坐标点作为0时刻贝塞尔曲线的位置坐标,当/>时,选取长边上的最后一个坐标点作为1时刻贝塞尔曲线的位置坐标。In the formula, Represents the coordinate set of points on the Bezier curve, /> Indicates the order of the Bezier curve, /> Indicates the first coordinates of control points, /> Indicates the first /> Bernstein polynomials for control points, /> represents the binomial coefficient, /> Indicates time, when corresponding to the coordinates of all points on the Bezier curve, because /> or 1, /> has a value of 0, so when /> , select the first coordinate point on the long side as the position coordinate of the Bezier curve at time 0, when /> , select the last coordinate point on the long side as the position coordinate of the Bezier curve at time 1.
通过式(2)生成两条贝塞尔曲线,将两条贝塞尔曲线首尾连接构成的闭合多边形作为该文本实例的真实文本框。Two Bezier curves are generated by formula (2), and the closed polygon formed by connecting the two Bezier curves end to end is used as the real text box of the text instance.
步骤1.5,将公共数据集的标签由文本框边界坐标点转变为关键点坐标和关键点的宽度。In step 1.5, the label of the public data set is changed from the coordinate point of the border of the text box to the coordinate of the key point and the width of the key point.
为了进一步精简数据集标签数量,提高检测效率,将两条长边上的边界点转化为一组关键点来表示文本框。在转化为关键点之前,首先要保证文本框上下两条长边的边界点数量一致,由于两条长边存在弯曲程度不一的情况,采取向上兼容的方式保证边界点数量一致,具体步骤如下:In order to further reduce the number of labels in the data set and improve the detection efficiency, the boundary points on the two long sides are converted into a set of key points to represent the text box. Before converting it into a key point, first ensure that the number of boundary points on the upper and lower long sides of the text box is the same. Since the two long sides have different degrees of curvature, an upwardly compatible method is adopted to ensure that the number of boundary points is consistent. The specific steps are as follows :
当上下两条边分别为直线和部分弯曲时,提取直线边的中点作为其中一个边界点,使得上下两边边界点为三个;当上下两条边分别为直线和完全弯曲时,按完全弯曲边坐标点数量对直线边进行等分,提取等分的坐标点,使得上下两边边界点数量一致;当上下两条边分别为部分弯曲和完全弯曲时,依照完全弯曲边坐标点数量减去部分弯曲边坐标点数量对部分弯曲边的两条曲线等分,提取等分的坐标点,使得上下两边边界点数量一致。When the upper and lower sides are straight and partially curved, extract the midpoint of the straight side as one of the boundary points, so that there are three boundary points on the upper and lower sides; when the upper and lower sides are straight and completely curved, press the full bending The number of side coordinate points divides the straight line equally, and extracts the coordinate points of the equal division so that the number of boundary points on the upper and lower sides is the same; when the upper and lower sides are partially curved and completely curved, subtract the part according to the number of coordinate points on the completely curved side The number of coordinate points of the curved edge divides the two curves of the partially curved edge equally, and extracts the coordinate points of the equal division, so that the number of boundary points on the upper and lower sides is consistent.
经过上述操作使上下边界点数量统一后,再对边界点进行转化。将上下边的坐标从首到尾一一对应,取对应坐标点的中点坐标作为关键点坐标,对应坐标点距离的二分之一作为关键点的宽度,至此公共图像文本数据集中的标签由边界框的坐标点转变为一组关键点坐标和对应的宽度,实现了以关键点为基础的标签制作。After the above operations make the number of the upper and lower boundary points uniform, then transform the boundary points. Correspond the coordinates of the upper and lower sides from the beginning to the end, take the midpoint coordinates of the corresponding coordinate points as the key point coordinates, and take half of the distance from the corresponding coordinate points as the width of the key point. So far, the labels in the public image text dataset are represented by The coordinate points of the bounding box are converted into a set of key point coordinates and corresponding widths, which realizes labeling based on key points.
步骤2,构建横幅文本检测网络模型。Step 2, build a banner text detection network model.
横幅文本检测网络模型包括特征提取模块、特征融合模块、回归模块和文本框生成模块。特征提取模块,用于提取不同层次的特征信息,得到从低层到高层的包含语意语义信息的特征图像。特征融合模块,用于将不同层次的特征图像进行叠加合并,得到融合的特征图像,用于后续对横幅文本检测。回归模块,用于回归文本实例形状,以及文本实例的关键点坐标和关键点的宽度。文本框生成模块,用于基于当前图像中的关键点坐标和宽度信息矢量信息,生成横幅图像文本框。The banner text detection network model includes a feature extraction module, a feature fusion module, a regression module and a text box generation module. The feature extraction module is used to extract feature information at different levels, and obtain feature images containing semantic and semantic information from low-level to high-level. The feature fusion module is used to superimpose and merge feature images of different levels to obtain a fused feature image for subsequent detection of banner text. The regression module is used to regress the shape of the text instance, as well as the key point coordinates and the width of the key point of the text instance. The text box generation module is used to generate a banner image text box based on the key point coordinates and width information vector information in the current image.
横幅文本检测网络模型首先通过Resnet50作为主干网络提取四张不同尺度的特征图像,利用FPN(特征金字塔网络)对不同尺度的特征图像依次合并,得到四张尺度不同的融合特征图像;对不同尺度的融合特征图像进行相应倍数的上采样,得到尺度相同的四张特征图像,然后将四张图像叠加,得到融合后的特征图像,接着将融合后的特征图像上采样四倍,得到与原图像大小相同的融合特征图像;对融合特征图像进行回归操作,得到两部分回归数据,将回归的文本轮廓形状与真实文本框形状比对,判断两者相似程度,将关键点回归的数据送入文本框生成模块,利用关键点的坐标和宽度信息,得到两组长边控制点坐标,将所得到的控制点转化为两条贝塞尔曲线,连接贝塞尔曲线得到文本实例的最终文本框。The banner text detection network model first uses Resnet50 as the backbone network to extract four feature images of different scales, and uses FPN (Feature Pyramid Network) to sequentially merge the feature images of different scales to obtain four fusion feature images of different scales; The fused feature images are up-sampled by corresponding multiples to obtain four feature images of the same scale, and then the four images are superimposed to obtain the fused feature image, and then the fused feature image is up-sampled four times to obtain the same size as the original image The same fusion feature image; perform regression operation on the fusion feature image to obtain two parts of regression data, compare the shape of the regressed text outline with the shape of the real text box, judge the similarity between the two, and send the key point regression data into the text box The generation module uses the coordinates and width information of the key points to obtain the coordinates of two sets of long-side control points, converts the obtained control points into two Bezier curves, and connects the Bezier curves to obtain the final text box of the text instance.
特征提取模块主干网络采用ResNet-50模型,将图像输入到ResNet-50模型后,首先对图像进行下采样,使得图像长宽分别降为原图像的1/4,通道数从3增加为64,接着采用1×1的卷积核使得图像在长宽不变的情况下通道数量由64增加到256,得到第一个特征图,然后对该特征图进行通道增加和下采样,使得特征图在图像长宽均降为1/2的同时通道数增加两倍,得到第二个特征图/>,重复执行此操作依次得到四张特征图像/>、/>、、/>。将去掉全连接层的ResNet-50与FPN结构结合,把主干网络中得到的四张不同尺度的特征图像作为FPN结构的输入。在进行不同尺度特征图像融合之前,需要对特征图像的通道数量进行统一处理,因此在FPN结构输入端增加了1×1的卷积核,使得特征图像通道数减小为256,得到/>、/>、/>、/>。从最低尺度的特征图/>开始,采用最邻近插值法进行两倍上采样,并与FPN结构输入端同尺度的特征图/>进行相加操作,得到融合后的较低尺度的特征图像/>,并再次采用最邻近插值法对/>进行两倍上采样后与/>相加,得到融合后的低尺度的特征图像/>,同样对/>进行两倍上采样后与/>相加,得到融合后的特征图像/>,最后将融合后的特征图像/>、/>、/>、/>作为FPN的输出。The backbone network of the feature extraction module adopts the ResNet-50 model. After the image is input to the ResNet-50 model, the image is first down-sampled so that the length and width of the image are reduced to 1/4 of the original image, and the number of channels is increased from 3 to 64. Then, a 1×1 convolution kernel is used to increase the number of channels from 64 to 256 while the length and width of the image remain unchanged, and the first feature map is obtained. , and then the feature map is channel-increased and down-sampled, so that the feature map reduces the length and width of the image to 1/2 and the number of channels is doubled to obtain the second feature map /> , repeat this operation to get four feature images in sequence /> , /> , , /> . The ResNet-50 with the fully connected layer removed is combined with the FPN structure, and the four feature images of different scales obtained in the backbone network are used as the input of the FPN structure. Before the fusion of feature images of different scales, the number of channels of the feature images needs to be processed uniformly, so a 1×1 convolution kernel is added to the input of the FPN structure, so that the number of feature image channels is reduced to 256, and /> , /> , /> , /> . Feature map from the lowest scale /> At the beginning, the nearest neighbor interpolation method is used for double upsampling, and the feature map of the same scale as the input of the FPN structure /> Perform an addition operation to obtain the fused lower-scale feature image/> , and again using nearest neighbor interpolation for /> After double upsampling with /> Add to get the fused low-scale feature image/> , same for /> After double upsampling with /> Add to get the fused feature image /> , and finally the fused feature image /> , /> , /> , /> as the output of the FPN.
特征融合模块是将不同尺度的融合特征图像进行合并,得到合并后的融合特征图像,具体计算过程如下:The feature fusion module is to merge the fusion feature images of different scales to obtain the merged fusion feature image , the specific calculation process is as follows:
(4) (4)
式中,表示通道连接,/> 和/>分别为2倍、4倍和8倍上采样,/>、、/>、/>为融合后的特征图像。In the formula, Indicates a channel connection, /> and /> 2x, 4x, and 8x upsampling respectively, /> , , /> , /> is the fused feature image.
使用3×3的卷积层(带有BN层和ReLU层以加速模型收敛,减少模型参数)将的通道数减少到256,接着对特征图像/>进行4倍上采样,使得/>与原始图像大小相同。Using a 3×3 convolutional layer (with BN layer and ReLU layer to speed up model convergence and reduce model parameters) will The number of channels is reduced to 256, followed by the feature image /> 4x upsampling such that /> Same size as original image.
回归模块包括形状回归和关键点回归两部分。形状回归通过Sigmoid激活函数的3×3卷积层将融合特征图转化为文本形状特征图,通过设定阈值/>为0.5对该特征图进行二值化,高于阈值0.5的区域作为文本区域,低于阈值0.5的区域为背景区域,得到文本与背景分离的文本形状二值图。将该二值图中的文本轮廓形状与图像关键点标签生成的文本框形状做比较,通过比较两者交并比IOU对二值图中的文本轮廓形状与图像关键点标签生成的文本框形状进行匹配。由于一张图像有多个文本,可能存在关键点与其他的文本相匹配的情况,形状回归的作用是保证关键点在对应的文本轮廓形状内,避免误检。关键点回归的输入是融合特征图/>,将融合特征图/>分别输入到两个不同的检测分支,其中一个分支检测的是关键点坐标,该检测分支通过一个3×3的卷积层和一个1×1的卷积层输出/>张关键点热图,/>为被检测图像文本实例中关键点最多的关键点数,选取关键点热图中得分最高的/>个高亮坐标点为这张关键点热图中的关键点坐标,其也是这张图像每个文本实例在这一类关键点对应的关键点坐标,/>为被检测图像的文本实例个数,文本实例关键点数量不足的,高亮坐标个数相应减少。另一个分支检测的是关键点宽度,该检测分支通过一个3×3的卷积层和一个1×1的卷积层输出/>个宽度信息,/>为被检测图像的文本实例个数,宽度信息与关键点一一对应,文本实例关键点数量不足的,剩余宽度信息取0。The regression module includes two parts: shape regression and key point regression. The shape regression will fuse the feature map through the 3×3 convolutional layer of the Sigmoid activation function Converted to a text shape feature map, by setting the threshold /> The feature map is binarized at 0.5, the area above the threshold of 0.5 is used as the text area, and the area below the threshold of 0.5 is used as the background area, and the text shape binary image separated from the background is obtained. Compare the shape of the text outline in the binary image with the shape of the text box generated by the image key point label, and compare the IOU to the shape of the text outline in the binary image and the shape of the text box generated by the image key point label to match. Since an image has multiple texts, there may be cases where key points match other texts. The function of shape regression is to ensure that key points are within the shape of the corresponding text outline to avoid false detection. The input for key point regression is the fused feature map /> , will fuse feature maps /> Input to two different detection branches, one of which detects the coordinates of key points, and the detection branch outputs through a 3×3 convolutional layer and a 1×1 convolutional layer/> heat map of key points, /> For the number of key points with the most key points in the detected image text instance, select the one with the highest score in the key point heat map The highlighted coordinate points are the key point coordinates in this key point heat map, which are also the key point coordinates of each text instance in this image corresponding to this type of key point,/> is the number of text instances in the detected image, if the number of key points of the text instance is insufficient, the number of highlighted coordinates will be reduced accordingly. The other branch detects the key point width, which is output through a 3×3 convolutional layer and a 1×1 convolutional layer/> width information, /> is the number of text instances in the detected image, and the width information corresponds to the key points one by one. If the number of key points of the text instance is insufficient, the remaining width information is set to 0.
文本框生成模块将回归模块输出的关键点坐标和宽度信息作为文本实例信息,并用该信息来生成文本框。具体的,关键点宽度为关键点到对应长边坐标点的距离,以两个相邻关键点的连线作为关键点与长边坐标点连线的法线,关键点垂直该法线向上下延伸该关键点对应宽度距离,终点坐标为长边坐标点。按上述方法对每个坐标点进行处理,得到两组与关键点数量相同的长边坐标点。将长边坐标点作为贝塞尔曲线的控制点,按公式(2)得到两条贝塞尔曲线,将两条贝塞尔曲线首尾相连,得到完全闭合的曲线框,将该曲线框作为该文本实例的文本框。最后将框定文本的图像输出,实现横幅图像的文本检测。The text box generation module takes the key point coordinates and width information output by the regression module as text instance information, and uses this information to generate a text box. Specifically, the width of the key point is the distance from the key point to the corresponding long-side coordinate point, and the line connecting two adjacent key points is used as the normal line connecting the key point and the long-side coordinate point, and the key point is perpendicular to the normal line up and down Extend the key point to correspond to the width distance, and the coordinates of the end point are the coordinates of the long side. Each coordinate point is processed according to the above method, and two sets of long-side coordinate points with the same number as key points are obtained. Take the coordinate point of the long side as the control point of the Bezier curve, obtain two Bezier curves according to the formula (2), connect the two Bezier curves end to end, and obtain a completely closed curve frame, and use the curve frame as the A text box for a text instance. Finally, the image of the framed text is output to realize the text detection of the banner image.
步骤3,利用步骤1得到的关键点数据集对步骤2构建的横幅文本检测网络模型进行训练。Step 3, use the key point data set obtained in step 1 to train the banner text detection network model built in step 2.
将步骤1中得到的关键点数据集分为训练集和测试集,把训练集输入到横幅文本检测网络模型中进行迭代训练,更新横幅文本检测网络模型的参数,使损失函数最小化,记录测试集测试模型的准确率,保存最优的模型。训练过程分为形状检测训练和关键点检测训练,对应的损失函数计算方式如下:Divide the key point data set obtained in step 1 into a training set and a test set, input the training set into the banner text detection network model for iterative training, update the parameters of the banner text detection network model, minimize the loss function, and record the test Set the accuracy of the test model and save the optimal model. The training process is divided into shape detection training and key point detection training, and the corresponding loss function It is calculated as follows:
(5) (5)
式中,为形状损失函数,/>为关键点损失函数,/>为损失函数的权重因子,本实施例设/>。In the formula, is the shape loss function, /> is the key point loss function, /> is the weight factor of the loss function, this embodiment sets /> .
考虑到横幅文本形状任意,且存在长宽比大的问题,采用CIOU损失函数来定义,具体公式如下:Considering that the shape of the banner text is arbitrary and there is a problem of large aspect ratio, the CIOU loss function is used to define , the specific formula is as follows:
(6) (6)
式中,表示回归出的文本轮廓形状和关键点标签生成的文本框的交并比,/>和/>分别表示回归出的文本轮廓形状和关键点标签生成的文本框的中心点坐标,回归出的文本轮廓形状的中心点坐标为文本轮廓形状的关键点中顺时针方向中位数的关键点坐标,关键点个数为双数时选择最中间两关键点连线的中心点坐标,关键点标签生成的文本框的中心点坐标为生成的文本框的关键点中顺时针方向中位数的关键点坐标,关键点个数为双数时选择最中间两关键点连线的中心点坐标,/>表示两个中心点的欧氏距离,/>表示能够同时包含回归出的文本轮廓形状和关键点标签生成的文本框的最小闭包区域的对角线长度,/>作为调节因子,用于平衡重叠面积和长宽比相似性之间的权重,/>是衡量长宽比相似性的指标。In the formula, Indicates the intersection and union ratio of the regressed text outline shape and the text box generated by the key point label, /> and /> represent the center point coordinates of the text box generated by the regressed text outline shape and the key point label, and the center point coordinates of the regressed text outline shape are the key point coordinates of the clockwise median of the key points of the text outline shape, When the number of key points is an even number, select the center point coordinates of the line connecting the two most middle key points, and the center point coordinates of the text box generated by the key point label is the key point of the clockwise median of the key points of the generated text box Coordinates, when the number of key points is an even number, select the coordinates of the center point of the line connecting the two most middle key points, /> Indicates the Euclidean distance between two center points, /> Indicates the diagonal length of the minimum closure area that can contain both the regressed text outline shape and the text box generated by the key point label, /> As an adjustment factor to balance the weight between overlapping area and aspect ratio similarity, /> is a measure of aspect ratio similarity.
关键点包含关键点坐标和宽度两个部分,因此关键点损失函数计算公式如下:The key point contains two parts: the key point coordinate and the width, so the key point loss function Calculated as follows:
(7) (7)
式中,为关键点坐标损失函数,/>为关键点宽度损失函数,/>为权重因子,本实施例设置为0.2。In the formula, is the key point coordinate loss function, /> is the key point width loss function, /> is a weighting factor, which is set to 0.2 in this embodiment.
考虑在训练的过程中,关键点坐标负样本的数量远远大于正样本数量,为了解决这种正负样本不平衡的问题,采用焦变损失函数的变体作为,即:Considering that in the process of training, the number of negative samples of key point coordinates is much larger than the number of positive samples, in order to solve the problem of imbalance between positive and negative samples, a variant of the focal loss function is used as ,Right now:
(8) (8)
式中,是回归模块回归的关键点热图中关键点/>的得分,/>表示带有关键点标签的图像经过高斯函数计算得到的真实关键点热图的坐标点得分,/>是图像中文本实例的数量,/>表示回归关键点热图的通道数量,/>和/>分别表示回归关键点热图的高和宽,/>和/>是控制每个关键点贡献的超参数,本实施例设置/>,/>,通过来减少对关键点坐标周围点的惩罚。In the formula, is the key point in the key point heatmap of the regression module regression /> score, /> Indicates the coordinate point score of the real key point heat map calculated by the Gaussian function of the image with the key point label, /> is the number of text instances in the image, /> Indicates the number of channels of the regression keypoint heatmap, /> and /> Indicate the height and width of the regression key point heat map, respectively, /> and /> is a hyperparameter that controls the contribution of each key point, this embodiment sets /> , /> ,pass to reduce the penalty for points around the keypoint coordinates.
由于每一个关键点生成的宽度都是随机的,因此采用L1损失函数作为:Since the width generated by each key point is random, the L 1 loss function is used as :
(9) (9)
式中,是图像中文本实例的数量,/>表示返回括号内数字的绝对值,/>表示带有关键点标签的图像经过高斯函数计算得到的真实关键点热图的坐标点得分,是回归模块回归的关键点热图中关键点/>的得分。In the formula, is the number of text instances in the image, /> Indicates to return the absolute value of the number in brackets, /> Indicates the coordinate point score of the real key point heat map calculated by the Gaussian function of the image with the key point label, is the key point in the key point heatmap of the regression module regression /> score.
为了加速模型收敛,在关键点坐标回归时,不考虑非文本区域坐标点,以减少负样本数量。In order to speed up the model convergence, when the key point coordinates are regressed, the non-text area coordinate points are not considered to reduce the number of negative samples.
使用训练集数据对横幅文本检测网络模型进行训练后,将测试集放入模型中,比较文本检测的准确率和检测速度,提取出最优的检测模型。After using the training set data to train the banner text detection network model, put the test set into the model, compare the accuracy and detection speed of text detection, and extract the optimal detection model.
步骤4,运用训练好的横幅文本检测网络模型检测横幅图像中的文本。Step 4, use the trained banner text detection network model to detect the text in the banner image.
将一张带有横幅文本的图像输入到步骤3训练好的横幅文本检测网络模型中,经过特征提取、特征融合、回归和文本框生成,得到带有文本框的横幅文本图像。具体过程包括:将一张横幅文本图像输入到横幅文本检测网络模型中,通过resnet50+FPN的特征提取网络得到四张尺度不同的特征图像,接着对四张图像进行不同倍数的上采样,使得图像尺度完全相同,再对四张特征图像进行特征融合,得到一张融合后的融合特征图像,对融合特征图像四倍上采样使其与原图像大小相同后,接着对融合特征图像进行激活映射,得到关键点热图,透过关键点热图得出关键点坐标和宽度,最后根据关键点坐标和宽度信息,计算出两组长边坐标点,将长边坐标点作为贝塞尔曲线的控制点生成两条贝塞尔曲线,将两条贝塞尔曲线首尾相连得到的闭合的曲线框作为文本框,输出带有文本框标注的横幅文本图像。Input an image with banner text into the banner text detection network model trained in step 3, after feature extraction, feature fusion, regression and text box generation, a banner text image with text box is obtained. The specific process includes: input a banner text image into the banner text detection network model, obtain four feature images of different scales through the feature extraction network of resnet50+FPN, and then perform up-sampling on the four images with different multiples, so that the image The scales are exactly the same, and then perform feature fusion on the four feature images to obtain a fused fusion feature image, quadruple upsample the fusion feature image to make it the same size as the original image, and then perform activation mapping on the fusion feature image, Get the key point heat map, get the key point coordinates and width through the key point heat map, and finally calculate two sets of long-side coordinate points according to the key point coordinates and width information, and use the long-side coordinate points as the control of the Bezier curve Point to generate two Bezier curves, connect the two Bezier curves end to end to get a closed curve frame as a text box, and output a banner text image with text box annotations.
实施例二Embodiment two
基于同一发明构思,本发明还提供一种基于贝塞尔曲线和关键点的横幅文本检测系统,包括处理器和存储器,存储器用于存储程序指令,处理器用于调用存储器中的程序指令执行如上所述的一种基于贝塞尔曲线和关键点的横幅文本检测方法。Based on the same inventive concept, the present invention also provides a banner text detection system based on Bezier curves and key points, including a processor and a memory, the memory is used to store program instructions, and the processor is used to call the program instructions in the memory for execution as described above A Banner Text Detection Method Based on Bezier Curves and Keypoints.
具体实施时,本发明技术方案提出的方法可由本领域技术人员采用计算机软件技术实现自动运行流程,实现方法的系统装置例如存储本发明技术方案相应计算机程序的计算机可读存储介质以及包括运行相应计算机程序的计算机设备,也应当在本发明的保护范围内。During specific implementation, the method proposed by the technical solution of the present invention can be implemented by those skilled in the art using computer software technology to realize the automatic operation process. The system device for realizing the method is, for example, a computer-readable storage medium that stores the corresponding computer program of the technical solution of the present invention and includes a computer that runs the corresponding computer program. The computer equipment of the program should also be within the protection scope of the present invention.
本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施案例,做各种各样的修改或补充或采用类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the technical field to which the present invention belongs may make various modifications or supplements or replace them in similar ways to the described specific implementation cases, but they will not deviate from the spirit of the present invention or go beyond what is described in the appended claims. defined range.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202310714974.5A CN116453133B (en) | 2023-06-16 | 2023-06-16 | Banner Text Detection Method and System Based on Bezier Curve and Key Points | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202310714974.5A CN116453133B (en) | 2023-06-16 | 2023-06-16 | Banner Text Detection Method and System Based on Bezier Curve and Key Points | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN116453133A CN116453133A (en) | 2023-07-18 | 
| CN116453133B true CN116453133B (en) | 2023-09-05 | 
Family
ID=87132471
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202310714974.5A Active CN116453133B (en) | 2023-06-16 | 2023-06-16 | Banner Text Detection Method and System Based on Bezier Curve and Key Points | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN116453133B (en) | 
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN117237965A (en) * | 2023-10-27 | 2023-12-15 | 山东浪潮科学研究院有限公司 | Training method, device, equipment and storage medium for curved text detection model | 
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN108564639A (en) * | 2018-04-27 | 2018-09-21 | 广州视源电子科技股份有限公司 | Handwriting storage method and device, intelligent interaction equipment and readable storage medium | 
| CN111414915A (en) * | 2020-02-21 | 2020-07-14 | 华为技术有限公司 | Character recognition method and related equipment | 
| CN112183322A (en) * | 2020-09-27 | 2021-01-05 | 成都数之联科技有限公司 | Text detection and correction method for any shape | 
| CN113537187A (en) * | 2021-01-06 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Text recognition method, device, electronic device and readable storage medium | 
| CN114898379A (en) * | 2022-05-10 | 2022-08-12 | 度小满科技(北京)有限公司 | A method, device, device and storage medium for curved text recognition | 
| CN115731539A (en) * | 2022-11-16 | 2023-03-03 | 武汉电信实业有限责任公司 | Video banner text detection method and system | 
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US8457403B2 (en) * | 2011-05-19 | 2013-06-04 | Seiko Epson Corporation | Method of detecting and correcting digital images of books in the book spine area | 
| US10289924B2 (en) * | 2011-10-17 | 2019-05-14 | Sharp Laboratories Of America, Inc. | System and method for scanned document correction | 
| US9230514B1 (en) * | 2012-06-20 | 2016-01-05 | Amazon Technologies, Inc. | Simulating variances in human writing with digital typography | 
| US11651215B2 (en) * | 2019-12-03 | 2023-05-16 | Nvidia Corporation | Landmark detection using curve fitting for autonomous driving applications | 
- 
        2023
        - 2023-06-16 CN CN202310714974.5A patent/CN116453133B/en active Active
 
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN108564639A (en) * | 2018-04-27 | 2018-09-21 | 广州视源电子科技股份有限公司 | Handwriting storage method and device, intelligent interaction equipment and readable storage medium | 
| CN111414915A (en) * | 2020-02-21 | 2020-07-14 | 华为技术有限公司 | Character recognition method and related equipment | 
| CN112183322A (en) * | 2020-09-27 | 2021-01-05 | 成都数之联科技有限公司 | Text detection and correction method for any shape | 
| CN113537187A (en) * | 2021-01-06 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Text recognition method, device, electronic device and readable storage medium | 
| CN114898379A (en) * | 2022-05-10 | 2022-08-12 | 度小满科技(北京)有限公司 | A method, device, device and storage medium for curved text recognition | 
| CN115731539A (en) * | 2022-11-16 | 2023-03-03 | 武汉电信实业有限责任公司 | Video banner text detection method and system | 
Non-Patent Citations (1)
| Title | 
|---|
| Yuliang Liu et,cl..ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network.《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition》.2020,第9806-9815页. * | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN116453133A (en) | 2023-07-18 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN111723585B (en) | Style-controllable image text real-time translation and conversion method | |
| CN112733822A (en) | End-to-end text detection and identification method | |
| CN111738055B (en) | Multi-category text detection system and bill form detection method based on the system | |
| CN116311310A (en) | Universal form identification method and device combining semantic segmentation and sequence prediction | |
| CN109977942A (en) | A kind of scene character recognition method based on scene classification and super-resolution | |
| CN111652240B (en) | CNN-based image local feature detection and description method | |
| CN115131797A (en) | Scene text detection method based on feature enhancement pyramid network | |
| CN116030453A (en) | An identification method, device and equipment for a digital electric meter | |
| CN113887468B (en) | Single-view human-object interaction identification method of three-stage network framework | |
| CN116453133B (en) | Banner Text Detection Method and System Based on Bezier Curve and Key Points | |
| Liu et al. | SLPR: A deep learning based Chinese ship license plate recognition framework | |
| CN113570540A (en) | Image tampering blind evidence obtaining method based on detection-segmentation architecture | |
| CN115909378A (en) | Training method of receipt text detection model and receipt text detection method | |
| CN118135423A (en) | A Deep Learning-Based Intelligent Detection Method for Ocean Temperature Fronts | |
| CN116259050B (en) | Text positioning and recognition method, device, equipment and detection method for filling barrel label | |
| CN115641573B (en) | Text ordering method and device, electronic equipment and storage medium | |
| CN117576699A (en) | Locomotive work order information intelligent recognition method and system based on deep learning | |
| CN118261923A (en) | A dual-encoding cross-fusion OCTA image segmentation method based on attention mechanism | |
| CN117975002A (en) | Weak supervision image segmentation method based on multi-scale pseudo tag fusion | |
| CN118644663A (en) | An infrared sea surface target detection method based on global-local fusion attention | |
| CN118469946A (en) | Insulator defect detection method for multiple defect categories based on multi-angle feature enhancement | |
| CN112419208A (en) | Construction drawing review-based vector drawing compiling method and system | |
| CN114898371A (en) | Bank receipt recognition method, system, equipment and storage medium | |
| CN114820369A (en) | Substation equipment point cloud segmentation method based on improved RandLA-Net | |
| CN114399681A (en) | Power energy equipment identification method, device and terminal equipment | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |