[go: up one dir, main page]

CN110458158B - Text detection and identification method for assisting reading of blind people - Google Patents

Text detection and identification method for assisting reading of blind people Download PDF

Info

Publication number
CN110458158B
CN110458158B CN201910501311.9A CN201910501311A CN110458158B CN 110458158 B CN110458158 B CN 110458158B CN 201910501311 A CN201910501311 A CN 201910501311A CN 110458158 B CN110458158 B CN 110458158B
Authority
CN
China
Prior art keywords
word
image
text
area
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910501311.9A
Other languages
Chinese (zh)
Other versions
CN110458158A (en
Inventor
毋超
郭璠
刘丽珏
马润洲
何汉东
刘嘉熙
康天硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910501311.9A priority Critical patent/CN110458158B/en
Publication of CN110458158A publication Critical patent/CN110458158A/en
Application granted granted Critical
Publication of CN110458158B publication Critical patent/CN110458158B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/235Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on user input or interaction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • G09B21/001Teaching or communicating with blind persons
    • G09B21/006Teaching or communicating with blind persons using audible presentation of the information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Educational Technology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种针对盲人辅助阅读的文本检测与识别方法,该方法包含以下步骤:步骤1:场景检测,该步骤主要检测相机所拍图像是否为手指放在阅读文本上的场景;步骤2:手指定位,该步骤实现对指尖的定位,并以此指尖作为后续文本检测的光标;步骤3:文本提取,该步骤主要包括文本行的提取及文本行中各单词的提取操作;步骤4:单词跟踪,该步骤主要对正确识别的单词,采用模板匹配方法对其单词框进行跟踪。本发明方法运行速度快,效果好,不仅能够很准确地识别到用户指尖所指的单词,而且成本代价低,具有很强的通用性,可广泛应用于穿戴式盲人辅助阅读戒指等智能产品。

Figure 201910501311

The invention discloses a text detection and recognition method for blind reading assistance. The method comprises the following steps: Step 1: scene detection, this step mainly detects whether the image captured by the camera is a scene where a finger is placed on the reading text; Step 2 : finger positioning, this step realizes the positioning of the fingertip, and uses the fingertip as the cursor for subsequent text detection; step 3: text extraction, this step mainly includes the extraction of text lines and the extraction of each word in the text line; step 4: Word tracking, this step mainly uses the template matching method to track the word frame of the correctly recognized word. The method of the invention has fast running speed and good effect, can not only accurately identify the word pointed by the user's fingertip, but also has low cost and high versatility, and can be widely used in smart products such as wearable blind auxiliary reading rings and the like .

Figure 201910501311

Description

一种针对盲人辅助阅读的文本检测与识别方法A Text Detection and Recognition Method for Blind Aided Reading

技术领域technical field

本发明属于计算机视觉的应用领域,特别涉及一种针对盲人辅助阅读的文本检测与识别方法。The invention belongs to the application field of computer vision, and particularly relates to a text detection and recognition method for blind reading aids.

背景技术Background technique

现今全世界范围内患有眼部疾病的已达3.14亿。其中2.69亿人患有低视力,盲人人数为0.45亿人。在中国目前视力残疾人口达877万人,约占全球盲人总数的19.5%,约占我国总人口的0.7%。根据有关权威机构的分析,六年以后,我国的盲人数量将突破七千五百万。因此,如何帮助盲人克服日常学习生活的困难,特别是最基本的阅读问题,具有极大的研究价值和社会意义,也具有广阔的应用前景。Today, there are 314 million people suffering from eye diseases worldwide. Of these, 269 million suffer from low vision and 45 million are blind. Currently, there are 8.77 million visually impaired people in China, accounting for about 19.5% of the total number of blind people in the world and about 0.7% of the total population in my country. According to the analysis of relevant authorities, the number of blind people in my country will exceed 75 million in six years. Therefore, how to help blind people overcome the difficulties of daily study and life, especially the most basic reading problems, has great research value and social significance, and also has broad application prospects.

目前市面上出现了多款盲人辅助阅读产品,例如一款戴在手指头上的盲人阅读器(Touch Reader)。该产品内置的扫描仪会自动将掠过的文字进行扫描识别,然后通过一个点阵将这些文字转换为凸起、凹下的盲文。由于点阵分布在指套的里层,所以手指头能感应到它的形状变化,从而让盲人朋友识别出这些盲文。与之类似,另一款触摸式盲人阅读器的底部可以将普通的文字读取经内部柱状阵列输出盲文信息,随后在顶部的面板上出现突出的柱状体,形成可以触摸识别的盲文。EyeRing指环由内嵌的微型图像扫描采集器采集书本中的文字内容,借助采集器背部靠近手指地方所设置的“盲文点显示器”实时变换盲点组合,达到让盲人朋友通过手指识别文本的效果。但上述这些产品对于没有学习过盲文的人难以使用这些装置。此外,其它产品如OrCam这一可穿戴设备由绑在眼镜上的小型摄像头和一套处理系统组成。该产品可以通过运行计算机视觉算法、对看到的东西进行解析,然后通过骨导语音告诉盲人、弱视群体等所看到的内容及信息。但该产品造价昂贵,且盲人不一定能正确将扫描眼镜对准读物,因此使用起来较为不便。由此可见,上述已有产品或者需要用户学习过盲文,或者价格昂贵且使用不便。Currently, there are a number of assisted reading products for the blind on the market, such as a Touch Reader worn on a finger. The built-in scanner of the product automatically scans and recognizes the passing text, and then converts the text into raised and recessed Braille through a dot matrix. Because the dot matrix is distributed on the inner layer of the finger cuff, the finger can sense its shape change, so that blind friends can recognize the braille. Similarly, the bottom of another touch-type blind reader can read ordinary text through the internal column array to output braille information, and then protruding columns appear on the top panel to form braille that can be recognized by touch. The EyeRing ring captures the text content in the book by the embedded miniature image scanning collector, and changes the blind spot combination in real time with the help of the "Braille dot display" set on the back of the collector near the finger, so as to allow blind friends to recognize the text through their fingers. But these products make it difficult for people who have not learned Braille to use these devices. In addition, other products like OrCam, a wearable device, consist of a small camera strapped to glasses and a processing system. The product can run computer vision algorithms, parse what you see, and then tell blind people and people with low vision what they see and information through bone conduction voice. However, the product is expensive, and the blind person may not be able to correctly align the scanning glasses with the reading material, so it is inconvenient to use. It can be seen that the above existing products either require users to learn Braille, or are expensive and inconvenient to use.

在面向盲人阅读方法专利方面,邱洪等人(专利公开号为CN108492682A)提供了一种盲人阅读器,此阅读器将摄像头获取的图片信息发送到图像识别处理器进行识别,并将识别结果以电平信号的方式反馈到驱动电路中以驱动盲文点阵组件输出对应的盲文字符。王璐(专利公开号为CN106601081A)所发明的一种盲人指环式阅读器能够通过设置于指环上的摄像头识别正常书籍上的印刷字体同时转成成盲文。但上述这些专利方法的主要问题在于其仍然需要用户会用盲文。此外,李重周等人(专利公开号为CN103077625A)提出了一种盲用电子阅读器和助盲阅读方法。该助盲阅读方法首先通过扫描或者拍照将纸质文字转化为电子图片格式数据,然后通过OCR识别技术将其识别为电子文本文档,最后采用TTS语音合成技术将电子文本文档转换为语音数据播放。但这一专利方法是将整幅图像中的文本一次性朗读出来,而未能提供用户指哪读哪这一便捷的个性化功能。Regarding the patent for the blind reading method, Qiu Hong et al. (patent publication number CN108492682A) provided a blind reader, which sends the picture information obtained by the camera to the image recognition processor for recognition, and uses the recognition result as The level signal is fed back to the driving circuit to drive the Braille dot matrix component to output the corresponding Braille characters. A ring reader for the blind invented by Wang Lu (Patent Publication No. CN106601081A) can recognize the printed fonts on normal books and convert them into Braille through a camera arranged on the ring. But the main problem with these patented methods is that they still require users to be able to use Braille. In addition, Li Chongzhou et al. (Patent Publication No. CN103077625A) proposed an electronic reader for the blind and a reading method for the blind. The blind reading method first converts paper text into electronic picture format data by scanning or taking pictures, then recognizes it as an electronic text document through OCR recognition technology, and finally uses TTS speech synthesis technology to convert the electronic text document into voice data for playback. However, this patented method reads the text in the entire image aloud at one time, and fails to provide a convenient personalization function for users to read where they want to.

在此背景下,研究一种鲁棒性强,准确性高、成本低且能够对盲人或低视用户手指所指文本进行自动检测与识别的方法就显得尤为重要Under this background, it is particularly important to study a method with strong robustness, high accuracy, low cost, and automatic detection and recognition of the text pointed to by the fingers of blind or low-sighted users.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是,提供一种针对盲人辅助阅读的文本检测与识别方法,使得盲人朋友或低视人群也能阅读普通书籍,解决了盲人朋友或低视力人群阅读困难的问题。The technical problem to be solved by the present invention is to provide a text detection and recognition method for assisted reading for blind people, so that blind friends or people with low vision can also read ordinary books, and solve the problem of difficulty in reading for blind friends or people with low vision.

本发明所采用的技术方案如下:The technical scheme adopted in the present invention is as follows:

一种针对盲人辅助阅读的文本检测与识别方法,包括以下步骤:A text detection and recognition method for blind reading aids, comprising the following steps:

步骤1:对于相机拍摄的图像序列,判断当前图像中的场景是否为手指放在阅读文本上,若是则进行步骤2,否则跳过该帧图像,将下一帧图像作为当前图像,进行上述判断和处理;Step 1: For the image sequence captured by the camera, determine whether the scene in the current image is that the finger is placed on the reading text, if so, go to step 2, otherwise skip this frame of image, and use the next frame of image as the current image to perform the above judgment and processing;

步骤2:在当前图像中定位用户指尖;Step 2: Position the user's fingertip in the current image;

步骤3:根据用户指尖的位置,确定用户指示的文本行;Step 3: Determine the text line indicated by the user according to the position of the user's fingertip;

步骤4:提取用户指示的文本行上的单词,将其转换为语音输出。Step 4: Extract the words on the text line indicated by the user and convert it to speech output.

进一步地,所述步骤1中判断当前图像中的场景是否为手指放在阅读文本上的方法如下:Further, in the step 1, the method for judging whether the scene in the current image is that the finger is placed on the reading text is as follows:

步骤11、通过相机预先拍摄一些典型的包含用户手指及其所在的文本区域的图像,保存于数据库中;Step 11. Pre-shoot some typical images including the user's finger and the text area where it is located by the camera, and save it in the database;

步骤12、将当前图像以及在该图像前面拍摄的多张图像作为样本图像;Step 12, using the current image and multiple images taken in front of the image as sample images;

步骤13、对数据库中的图像和样本图像的RGB色彩空间分别进行归一化处理;Step 13, normalize the RGB color space of the image in the database and the sample image respectively;

步骤14、对于每个样本,分别计算其归一化处理后的红色通道图像与数据库中各张图像归一化处理后的红色通道图像的欧式距离,将结果中的最小值作为该样本的匹配分数;求取所有样本匹配分数的均值μIm与方差σIm,若μImIm<Th,则认定当前图像中的场景为手指放在阅读文本上,其中Th为阈值,为经验参数。Step 14. For each sample, calculate the Euclidean distance between the normalized red channel image and the normalized red channel image of each image in the database, and use the minimum value in the result as the matching of the sample. Score; obtain the mean μ Im and variance σ Im of all sample matching scores, if μ ImIm <Th, the scene in the current image is determined to be a finger on the reading text, where Th is the threshold, which is an empirical parameter.

进一步地,所述步骤14中,将所有图像都缩小到设定尺寸,再计算欧式距离。Further, in the step 14, all images are reduced to the set size, and then the Euclidean distance is calculated.

进一步地,所述步骤2包括以下步骤:Further, the step 2 includes the following steps:

步骤21、使用K-means找到用户指尖的候选区域;Step 21. Use K-means to find the candidate area of the user's fingertip;

首先,使用高斯滤波器对当前图像进行滤波;First, use a Gaussian filter to filter the current image;

然后,根据滤波后的图像中三个通道的图像生成三个二维矩阵,每个二维矩阵中的元素值为相应通道的图像上相应点的像素值;Then, three two-dimensional matrices are generated according to the images of the three channels in the filtered image, and the element value in each two-dimensional matrix is the pixel value of the corresponding point on the image of the corresponding channel;

对每个二维矩阵,将其所有列求和再取平均值,得到一个row×1的列向量mc_ave;将其所有行求和再取平均值,得到一个1×col的行向量mr_ave;由此,把当前图像转化成三个列向量和三个的行向量,其中col表示二维矩阵的总列数,row表示二维矩阵的总行数;For each two-dimensional matrix, sum all its columns and take the average to get a row×1 column vector m c_ave ; sum all its rows and take the average to get a 1×col row vector m r_ave ;Therefore, convert the current image into three column vectors and three row vectors, where col represents the total number of columns of the two-dimensional matrix, and row represents the total number of rows of the two-dimensional matrix;

将列向量的每个维度作为一个纵向数据点,把三个列向量相同维度的分量作为相应的纵向数据点的三个特征,构成该纵向数据点的特征向量,纵向数据点的个数等于列向量的维度,即row;将行向量的每个维度作为一个横向数据点,把三个行向量相同维度的分量作为相应的横向数据点的三个特征,构成该横向数据点的特征向量,横向数据点的个数等于行向量的维度,即col;Each dimension of the column vector is regarded as a longitudinal data point, and the components of the same dimension of the three column vectors are regarded as the three features of the corresponding longitudinal data point to form the feature vector of the longitudinal data point, and the number of longitudinal data points is equal to the number of columns. The dimension of the vector, namely row; take each dimension of the row vector as a horizontal data point, and take the components of the same dimension of the three row vectors as the three features of the corresponding horizontal data point to form the feature vector of the horizontal data point, the horizontal The number of data points is equal to the dimension of the row vector, that is, col;

其次,使用K-means分别对纵向数据点和横向数据点进行聚类,聚类数目均为2;Secondly, K-means is used to cluster the longitudinal data points and the horizontal data points respectively, and the number of clusters is 2;

再次,将纵向数据点的聚类的结果表示为一个纵向标签向量,其为一个row×1的列向量,其各维度的分量表示为相应维度的纵向数据点的标签,取值为0或1;将横向数据点的聚类的结果表示为一个横向标签向量,其为一个1×col的行向量,其各维度的分量表示为相应维度的横向数据点的标签,取值为0或1;Again, the result of the clustering of longitudinal data points is expressed as a longitudinal label vector, which is a row×1 column vector, and the components of each dimension are expressed as the labels of the longitudinal data points of the corresponding dimension, and the value is 0 or 1 ; Express the result of the clustering of the horizontal data points as a horizontal label vector, which is a 1×col row vector, and the components of each dimension are expressed as the labels of the horizontal data points of the corresponding dimension, taking the value 0 or 1;

分别对纵向标签向量和横向标签向量先进行均值滤波,再进行阈值处理,若某维度元素值大于或等于设定阈值,则将其设置为1,否则将其设置0,得到最终的纵向标签向量和横向标签向量;Perform mean filtering on the vertical label vector and horizontal label vector respectively, and then perform threshold processing. If the value of a dimension element is greater than or equal to the set threshold, set it to 1, otherwise set it to 0 to obtain the final vertical label vector and horizontal label vector;

将纵向标签向量中元素值0和1的分界点在当前图像中的对应的水平线与横向标签向量中元素值0和1的左侧分界点在当前图像中的对应的竖直线的交点为左上顶点,划定一个矩形区域作为用户指尖的候选区域;Set the intersection point of the corresponding horizontal line in the current image with the demarcation point of element values 0 and 1 in the vertical label vector and the corresponding vertical line in the current image of the left demarcation point of element values 0 and 1 in the horizontal label vector as the upper left Vertex, delineate a rectangular area as a candidate area for the user's fingertip;

步骤22、通过计算曲率定位指尖;Step 22. Position the fingertip by calculating the curvature;

首先,采用canny算子求取用户指尖的候选区域中的边缘,连接边缘得到轮廓;若得到多个轮廓,则只保留包含像素点个数不小于设定阈值的轮廓;First, the canny operator is used to obtain the edge in the candidate area of the user's fingertip, and the edge is connected to obtain the contour; if multiple contours are obtained, only the contour containing the number of pixels not less than the set threshold is retained;

然后,对保留下来的轮廓进行平滑处理;Then, smooth the remaining contours;

最后,对平滑处理后的轮廓,计算轮廓上每个像素点的曲率,曲率为零的点即为用户指尖位置。Finally, for the smoothed contour, the curvature of each pixel on the contour is calculated, and the point with zero curvature is the position of the user's fingertip.

进一步地,使用K-means对纵向数据点/横向数据点进行聚类的过程中,随机初始化两组聚类中心,进行两次聚类,评估两次聚类结果的紧凑度,选取紧凑度好的聚类结果作为最终的聚类结果。Further, in the process of clustering vertical data points/horizontal data points using K-means, two groups of cluster centers are randomly initialized, two clusters are performed, and the compactness of the two clustering results is evaluated, and a good compactness is selected. The clustering result is used as the final clustering result.

进一步地,所述步骤3包括以下步骤:Further, the step 3 includes the following steps:

步骤31、文本区域的提取;Step 31, the extraction of the text area;

首先,当前图像转换成灰度图像;First, the current image is converted into a grayscale image;

然后,对灰度图像进行二值化处理,得到图像的前景区域和后景区域;Then, the grayscale image is binarized to obtain the foreground area and the background area of the image;

再排除后景区域中的非文本区域,方法如下:Then exclude the non-text areas in the background area as follows:

先提取出后景区域中所有的连通区域,构成集合CR;然后对集合CR中的每个连通区域求取其旋转矩形,记为ζ(ο,θ,w,h),其中o代表旋转矩形的中心,θ代表旋转矩形所偏转的角度,是水平轴逆时针旋转,与碰到的旋转矩形的第一条边的夹角,w和h分别代表旋转矩形相邻的两条边;再过滤掉CR中旋转矩形的面积和所偏转的角度不符合约束条件的连通区域,对于剩下的连通区域,基于文本区域之间的关系进行进一步过滤,具体实现包括以下步骤:First extract all the connected regions in the background area to form a set CR ; then each connected region in the set CR is obtained its rotation rectangle, denoted as ζ(ο, θ, w, h), wherein o represents The center of the rotating rectangle, θ represents the angle deflected by the rotating rectangle, which is the counterclockwise rotation of the horizontal axis, and the angle between the first side of the rotating rectangle encountered, w and h respectively represent the two adjacent sides of the rotating rectangle; Then filter out the connected areas in which the area of the rotated rectangle and the deflected angle in the CR do not meet the constraints. For the remaining connected areas, further filtering is performed based on the relationship between the text areas. The specific implementation includes the following steps:

3.1)以当前图像的左上顶点为原点O,当前图像长度方向为y轴,取右方向为正,当前图像宽度方向为x轴,取下方向为正;将每个连通区域R的旋转矩形中心作为一个关注点,将当前图像中的每条过关注点的直线,表示为如下形式:3.1) Take the upper left vertex of the current image as the origin O, the length direction of the current image is the y-axis, the right direction is positive, the width direction of the current image is the x-axis, and the removal direction is positive; As a point of interest, each line passing through the point of interest in the current image is represented as follows:

xcosθSL+ysinθSL=ρSLxcosθ SL +ysinθ SLSL ;

其中,θSL为直线与x轴的夹角,ρSL为原点O到直线的距离,θSL的取值范围(-π/2,π/2),ρSL的取值范围为(-D,D),D为相机拍摄的原始图像对角线的长;Among them, θ SL is the angle between the straight line and the x-axis, ρ SL is the distance from the origin O to the straight line, the value range of θ SL is (-π/2, π/2), and the value range of ρ SL is (-D , D), D is the diagonal length of the original image captured by the camera;

3.2)把(ρSLSL)参数空间细分为多个累加器单元,将坐标(ρkk)处的累加器单元的值记为A(ρkk);首先将全部累计器单元都设置为零,然后分别计算每个关注点(xi,yi)到直线xcosθk+ysinθk=ρk的距离d:3.2) Subdivide the (ρ SL , θ SL ) parameter space into multiple accumulator units, and denote the value of the accumulator unit at the coordinates (ρ k , θ k ) as A(ρ k , θ k ); All accumulator cells are set to zero, and the distance d from each point of interest (x i , y i ) to the line xcosθ k +ysinθ kk is calculated separately:

d=|xicosθk+yisinθkk|;d=|x i cosθ k +y i sinθ kk |;

对所有关注点到直线xcosθk+ysinθk=ρk的距离依次进行判断,每有一个关注点对应的距离小于阈值,则A(ρkk)值加1,判断完之后,得到最终的A(ρkk)值;若A(ρkk)值高于该阈值就认为相应直线为参考直线,记得到的参考直线条数为N;The distances from all attention points to the straight line xcosθ k +ysinθ kk are judged in turn. For each concern point corresponding to a distance smaller than the threshold, the value of A(ρ kk ) is increased by 1. After the judgment, the final result is obtained. The A(ρ k , θ k ) value of ; if the A(ρ k , θ k ) value is higher than the threshold value, the corresponding straight line is considered as a reference straight line, and the number of reference straight lines is N;

3.3)通过无监督线聚类的方法找到文本区域,具体过程为:3.3) Find the text area by the method of unsupervised line clustering, the specific process is as follows:

3.31)输入关注点集合,初始化基准直线集合CL,其中包括N条基准直线,分别为步骤3.2)求得的N条参考直线;3.31) Input the set of attention points, and initialize the reference straight line set CL , which includes N reference straight lines, which are respectively the N reference straight lines obtained in step 3.2);

3.32)计算关注点集合中的所有关注点到集合CL中每条基准直线的距离;对于每个关注点,取最小距离值;筛选出最小距离值小于设定阈值的点;对筛选出的关注点标记类别,将最小距离值对应同一条基准直线的关注点归于同一类;3.32) Calculate the distance from all points of interest in the point of interest set to each reference line in the set CL ; for each point of interest, take the minimum distance value; filter out the points whose minimum distance value is less than the set threshold; Marking the category of attention points, and assigning the attention points whose minimum distance value corresponds to the same reference line to the same category;

3.33)对相同类别的关注点进行直线拟合,并判断新拟合出的直线与该类别对应的基准直线的斜率和截距的差值是否小于设定阈值,若小于,则集合CL中该类别对应的基准直线保持不变,否则,将集合CL中该类别对应的基准直线更新为将新拟合出的直线;若此步骤中集合CL中所有的基准直线均保持不变,则输出带有类别标记的关注点集合CS,这些关注点对应的旋转矩形所包含的区域即为提取出的文本区域,否则返回步骤3.31);3.33) Perform straight line fitting on the points of interest of the same category, and judge whether the difference between the slope and intercept of the newly fitted straight line and the reference straight line corresponding to the category is less than the set threshold, if it is less than the set CL The reference line corresponding to this category remains unchanged, otherwise, update the reference line corresponding to this category in the set CL to the newly fitted line; if all the reference lines in the set CL remain unchanged in this step, Then output the set of attention points C S with category marks, and the area contained in the rotation rectangle corresponding to these attention points is the extracted text area, otherwise return to step 3.31);

步骤32、确定文本行;Step 32, determine the text line;

根据步骤2得到的指尖位置,确定一个矩形的感兴趣区域,指尖位于在感兴趣区域的底边上;According to the fingertip position obtained in step 2, determine a rectangular area of interest, and the fingertip is located on the bottom edge of the area of interest;

对步骤31提取出的文本区域,求取其中每个字符的最外层轮廓;选取每个轮廓的最底端的点作为基准点,筛选出位于感兴趣区域中的基准点;To the text area that step 31 is extracted, seek the outermost contour of each character wherein; Select the bottom point of each contour as the reference point, and filter out the reference point located in the region of interest;

对于筛选出的基准点,每次选取三个相邻的基准点,进行直线拟合,得到多条直线;For the selected datum points, select three adjacent datum points each time, and perform straight line fitting to obtain multiple straight lines;

对于所有拟合出来的直线,分别进行评分,评分公式如下:All the fitted straight lines are scored separately. The scoring formula is as follows:

Figure GDA0003305861490000051
Figure GDA0003305861490000051

其中,d(i)distance为筛选出的第i个基准点到直线的距离,n为筛选出的基准点总数,μscore为得分;Among them, d(i) distance is the distance from the i-th benchmark point screened to the straight line, n is the total number of benchmark points screened out, and μ score is the score;

选取分数最低的直线作为判定线;过滤掉筛选出的基准点中到判定线的距离小于设定阈值的基准点,再基于剩下的所有基准点,进行直线拟合,拟合出的直线即为用户指示的文本行。Select the straight line with the lowest score as the judgment line; filter out the datum points whose distance to the judgment line is less than the set threshold, and then perform straight line fitting based on all the remaining datum points, and the fitted line is Line of text indicated for the user.

进一步地,所述步骤4的具体处理过程如下:Further, the specific processing process of the step 4 is as follows:

步骤41、识别首个单词;Step 41, identify the first word;

对于步骤31提取出的文本区域,提取出其位于感兴趣区域中的部分,作为目标文本区域;分别求取目标文本区域中每个字符的最小外接矩形,作为一个字母框;根据单词内的两个相邻字母框中心的距离与单词间两个相邻字母框中心的距离的差异,对字母框进行聚类,合成单词;求取属于一个单词中的所有字母框的最小外接矩形作为单词框;For the text area extracted in step 31, extract the part located in the area of interest as the target text area; respectively obtain the minimum circumscribed rectangle of each character in the target text area as a letter box; The difference between the distance between the centers of two adjacent letter boxes and the distance between the centers of two adjacent letter boxes between words, cluster the letter boxes, and synthesize words; find the smallest circumscribed rectangle belonging to all letter boxes in a word as the word box ;

根据图像中用户指示的文本行与水平方向所呈角度,对图像进行角度补偿,使其中文本行旋转至水平方向;Perform angle compensation on the image according to the angle between the text line indicated by the user and the horizontal direction in the image, so that the text line is rotated to the horizontal direction;

选取沿着文本行的第一个单词框,然后使用OCR识别技术进行识别,识别所返回的结果包括:单词、单词置信度和单词框;当返回的单词置信度大于阈值时,则认为单词被正确识别,对正确识别的单词进行语音输出;Select the first word box along the text line, and then use OCR recognition technology to identify, the results returned by the recognition include: word, word confidence and word box; when the returned word confidence is greater than the threshold, the word is considered to be Correct recognition, voice output for correctly recognized words;

步骤42、采用模板匹配方法跟踪正确识别的单词的单词框在后续图像帧上的位置,以确定后续图像帧上的新单词区域,在新单词区域中识别新的单词;Step 42, adopt the template matching method to track the position of the word frame of the correctly identified word on the follow-up image frame, to determine the new word area on the follow-up image frame, and identify the new word in the new word area;

vi.对各帧图像进行二值化处理;vi. Binarize each frame of images;

初始化s=1,初始化指尖速度Vfingertip;将第l帧图像中识别出的第j个单词对应的单词框及其包含的所有字母框作为需要跟踪的单词框和字母框;将第m帧图像中的感兴趣区域作为搜索区域;初始化m=l+1;Initialize s=1, initialize the fingertip speed V fingertip ; take the word frame corresponding to the jth word identified in the lth frame image and all the letter frames it contains as the word frame and letter frame that need to be tracked; take the mth frame The region of interest in the image is used as the search region; initialize m=l+1;

vii.对需要跟踪的单词框/字母框,采用模板匹配的方法跟踪其在搜索区域中的位置;vii. For the word box/letter box to be tracked, use template matching to track its position in the search area;

若当前需要跟踪的单词框跟踪成功,则进入步骤iii;否则判断之前是否有跟踪成功的单词框,若有,则先在第m帧图像上与在最新跟踪成功单词框的匹配位置右方确定一条截止线,截止线与匹配位置右边重合;再判断截止线距离图像左边的宽度是否小于设定阈值,若是则将截止线右边的目标文本区域作为新单词区域,继续在新单词区域中识别新的单词;否则结束新单词识别;若之前没有跟踪成功的单词框,则将第m帧图像作为当前图像,对其执行步骤1~步骤41,进行首个单词的识别;If the current word frame to be tracked is successfully tracked, go to step iii; otherwise, determine whether there is a successfully tracked word frame before, if so, first determine the matching position on the m-th frame image with the matching position of the latest tracked successful word frame. A cutoff line, the cutoff line coincides with the right side of the matching position; then judge whether the width of the cutoff line from the left side of the image is less than the set threshold, if so, take the target text area to the right of the cutoff line as the new word area, and continue to identify new words in the new word area. Otherwise, end the new word recognition; if there is no word frame successfully tracked before, then take the mth frame image as the current image, and perform steps 1 to 41 on it to recognize the first word;

viii.更新指尖速度Vfingertipviii. Update fingertip speed V fingertip :

Figure GDA0003305861490000061
Figure GDA0003305861490000061

其中,Vword,j代表跟踪成功的第j个单词框在第l帧图像和第m帧图像中的位置的水平距离差,Vletter,k代表跟踪成功的第k个字母框在在第l帧图像和第m帧图像中的位置的水平距离差,N1代表匹配成功的单词框数目,N2代表匹配成功的字母框数目;Among them, V word,j represents the horizontal distance difference between the position of the successfully tracked jth word box in the lth frame image and the mth frame image, V letter,k represents the successfully tracked kth letter box in the lth frame image The horizontal distance difference between the frame image and the position in the mth frame image, N 1 represents the number of successfully matched word boxes, and N 2 represents the number of successfully matched letter boxes;

ix.令s=s+1;ix. let s=s+1;

对第l帧图像中识别出第s个单词对应的单词框及其包含的每个字母框,根据指尖速度判断该单词框和各个字母框是否会移出第m帧图像,若第l帧图像上单词框/字母框的左上顶点的横坐标减去Vfingertip×(m-l)小于零则判定该单词框/字母框会移出第m帧图像;For the word box corresponding to the sth word and each letter box it contains in the lth frame image, judge whether the word box and each letter box will move out of the mth frame image according to the fingertip speed. If the lth frame image If the abscissa of the upper left vertex of the upper word box/letter box minus V fingertip ×(ml) is less than zero, it is determined that the word box/letter box will move out of the mth frame image;

将判断不会移出当前桢图像的单词框/字母框作为需要跟踪的单词框/字母框;并划定一个矩形区域作为新的搜索区域,该矩形区域的左上角横坐标=上一个进行跟踪的单词框的左上角横坐标-指尖速度-设定偏移值,矩形区域的长=上一个进行跟踪的单词框的长+指尖速度,矩形区域的宽=上一个进行跟踪的单词框的宽+设定偏移值;The word box/letter box that is judged not to be moved out of the current frame image is used as the word box/letter box that needs to be tracked; and a rectangular area is defined as a new search area, the abscissa of the upper left corner of the rectangular area = the previous tracked area The abscissa of the upper left corner of the word box - fingertip speed - set the offset value, the length of the rectangular area = the length of the last word box to be tracked + the speed of the fingertip, the width of the rectangular area = the length of the last word box to be tracked width + set offset value;

计算新的搜索区域中黑色像素与白色像素的比例,若此比例小于设定阈值,则丢弃该帧图像,并令m=m+1,重新进行前述判断和处理直到新的搜索区域中黑色像素与白色像素的比不小于设定阈值,则进入步骤v;否则直接进入步骤v;Calculate the ratio of black pixels to white pixels in the new search area, if this ratio is less than the set threshold, discard the frame image, and let m=m+1, and repeat the above judgment and processing until the black pixels in the new search area If the ratio to the white pixel is not less than the set threshold, then enter step v; otherwise, directly enter step v;

x.返回步骤ii。x. Return to step ii.

有益效果:Beneficial effects:

本发明公开了针对盲人辅助阅读的文本检测与识别方法,该方法包含以下步骤:步骤1:场景检测,该步骤主要检测相机所拍图像是否为手指放在阅读文本上的场景;步骤2:手指定位,该步骤实现对指尖的定位,并以此指尖作为后续文本检测的光标;步骤3:文本提取,该步骤主要包括文本行的提取及文本行中各单词的提取操作;步骤4:单词跟踪,该步骤主要对正确识别的单词,采用模板匹配方法对其单词框进行跟踪。利用此方法,同时结合语音输出,可运用于相关盲人辅助阅读产品,从而使用户即便不懂盲文也能方便快捷地获知所指处的文本内容。The invention discloses a text detection and recognition method for blind reading aids. The method includes the following steps: step 1: scene detection, which mainly detects whether the image captured by the camera is a scene where the finger is placed on the reading text; step 2: the finger Positioning, this step realizes the positioning of the fingertip, and uses the fingertip as the cursor for subsequent text detection; Step 3: text extraction, this step mainly includes the extraction of text lines and the extraction of each word in the text line; Step 4: Word tracking, this step mainly uses template matching method to track the word frame of the correctly recognized word. Using this method, combined with voice output, it can be applied to related auxiliary reading products for the blind, so that even if the user does not understand Braille, he can easily and quickly know the text content pointed to.

相比于其它的助盲阅读方法,较多的采用的是将所拍图像文本转换为盲人朋友可感知的盲文形式,这就要求使用者必须学会盲文。而相对来说本方案,采用的是直接将用户手指所指单词进行语音输出的方式。因此,即便用户不懂盲文,也能随时随地方便快捷地享受到阅读的乐趣。同时,本方案方法运行速度快,效果好,不仅能够很准确地识别到用户指尖所指的单词,而且成本代价低,具有很强的通用性,可广泛应用于穿戴式盲人辅助阅读戒指等智能产品。Compared with other reading aids for the blind, more commonly used is to convert the captured image text into Braille form that can be perceived by blind friends, which requires users to learn Braille. Relatively speaking, this solution adopts the method of directly outputting the word pointed by the user's finger by voice. Therefore, even if users do not understand Braille, they can easily and quickly enjoy reading anytime, anywhere. At the same time, the method of this scheme runs fast and has good effect. It can not only accurately identify the word pointed by the user's fingertip, but also has low cost and strong versatility, and can be widely used in wearable auxiliary reading rings for the blind, etc. Smart products.

附图说明Description of drawings

图1为本发明方法的总体实施流程示意图;Fig. 1 is the overall implementation flow schematic diagram of the method of the present invention;

图2为本发明方法的场景检测过程;其中图2(a)为提前建立的手指数据库图像,图2(b)为手指图像的归一化,图2(c)为摄像头所拍摄的图像与手指数据库图像模板匹配过程的示意图;Fig. 2 is the scene detection process of the method of the present invention; wherein Fig. 2(a) is the finger database image established in advance, Fig. 2(b) is the normalization of the finger image, Fig. 2(c) is the image taken by the camera and the Schematic diagram of the matching process of finger database image template;

图3为本发明方法的指尖检测过程;其中图3(a)为相机的原始输入图片,图3(b)为输入图片的RGB通道,图3(c)为RGB通道转换而成的列向量,图3(d)为指尖可能存在的区域,图3(e)为放大的指尖可能存在的区域,图3(f)为找到的指尖位置;Fig. 3 is the fingertip detection process of the method of the present invention; wherein Fig. 3(a) is the original input picture of the camera, Fig. 3(b) is the RGB channel of the input picture, Fig. 3(c) is the column converted from the RGB channel vector, Fig. 3(d) is the possible area of the fingertip, Fig. 3(e) is the possible area of the enlarged fingertip, and Fig. 3(f) is the found fingertip position;

图4为本发明方法的旋转矩形定义的示意图;Fig. 4 is the schematic diagram of the rotation rectangle definition of the method of the present invention;

图5为本发明方法的文本提取过程;其中图5(a)为将指尖位置彩色图像进行灰度化后得到的灰度图像,图5(b)为对图5(a)所示的灰度图像进行二值化处理后得到的二值图像,图5(c)为根据面积条件对图5(b)过滤后的结果,图5(d)为依据角度条件进一步对图5(c)过滤后的结果;Fig. 5 is the text extraction process of the method of the present invention; wherein Fig. 5(a) is a grayscale image obtained by graying the color image of the fingertip position, and Fig. 5(b) is a comparison of the image shown in Fig.5(a). The binary image obtained after the grayscale image is binarized, Fig. 5(c) is the result of filtering Fig. 5(b) according to the area condition, Fig. 5(d) is the further analysis of Fig. 5(c) according to the angle condition ) filtered result;

图6为本发明方法的文本区域提取过程;其中图6(a)为经过角度过滤所得到的图像,图6(b)为所有ζ(o)的坐标表示,图6(c)为(ρSLSL)矩阵的三维表示,z轴为矩阵的值,图6(d)为所求取的初始化直线,图6(e)为经过线聚类所得到的文本区域;Fig. 6 is the text region extraction process of the method of the present invention; Fig. 6(a) is the image obtained by angle filtering, Fig. 6(b) is the coordinate representation of all ζ(o), Fig. 6(c) is (ρ) SL , θ SL ) three-dimensional representation of the matrix, the z-axis is the value of the matrix, Figure 6(d) is the obtained initialization line, and Figure 6(e) is the text area obtained by line clustering;

图7为本发明方法的文本行确定过程;图7(a)为经过线聚类所得到的文本区域和指尖位置,图7(b)为由指尖位置所确定关键区域的示意图;图7(c)为提取出来的关注区域,并在其上标注文本行和异常直线示意图。Fig. 7 is the text line determination process of the method of the present invention; Fig. 7 (a) is the text area and fingertip position obtained through line clustering, Fig. 7 (b) is the schematic diagram of the key area determined by fingertip position; Fig. 7(c) is a schematic diagram of the extracted attention area, and the text line and abnormal straight line are marked on it.

图8为本发明方法的单词识别过程;其中图8(a)为确定单词框的示意图;图8(b)为根据文本行确定旋转角度;图8(c)为对图像进行角度补偿后的结果;Fig. 8 is the word recognition process of the method of the present invention; wherein Fig. 8 (a) is a schematic diagram of determining the word frame; Fig. 8 (b) is to determine the rotation angle according to the text line; Fig. 8 (c) is the image after the angle compensation is performed result;

图9为本发明方法的单词跟踪示意图;图9(a)为跟踪阅读时的;图9(b)为模糊块检测示意图。Fig. 9 is a schematic diagram of word tracking in the method of the present invention; Fig. 9(a) is a schematic diagram of tracking reading; Fig. 9(b) is a schematic diagram of fuzzy block detection.

图10为本发明方法在运行时的实际效果图。FIG. 10 is an actual effect diagram of the method of the present invention during operation.

具体实施方式Detailed ways

下面结合附图说明对本发明做进一步说明:The present invention will be further described below in conjunction with the accompanying drawings:

本实施例是针对助盲阅读这一特定应用,其对所拍图像中文本的检测与识别按如下步骤进行,整体实施流程如图1所示。由该图可知,详细实施流程主要包括以下主要步骤:This embodiment is aimed at the specific application of reading assistance for the blind, and the detection and recognition of the text in the captured image is carried out according to the following steps, and the overall implementation process is shown in FIG. 1 . As can be seen from the figure, the detailed implementation process mainly includes the following main steps:

步骤1:通过场景检测判断当前图像中的场景是否为手指放在阅读文本上,若是则进行步骤2,否则不进行后续步骤;此步骤的具体处理过程如下:Step 1: Determine whether the scene in the current image is a finger placed on the reading text through scene detection, if so, go to Step 2, otherwise do not go to the subsequent steps; the specific processing process of this step is as follows:

步骤11、通过相机预先拍摄一些典型的包含用户手指及其所在的文本区域的图像,保存于数据库中,见图2(a);Step 11. Pre-shoot some typical images including the user's finger and the text area where it is located by the camera, and save it in the database, as shown in Figure 2(a);

步骤12、将当前图像以及在该图像前面拍摄的19张图像(即相机最近连续拍摄的20张图像)作为样本;Step 12, take the current image and the 19 images shot in front of the image (that is, the 20 images shot continuously by the camera recently) as samples;

步骤13、颜色归一化;如图2(b)所示,对数据库中的图像和样本图像的RGB色彩空间分别进行归一化处理,以便减少光照和阴影的影响:Step 13, color normalization; as shown in Figure 2(b), normalize the RGB color space of the image in the database and the sample image respectively, in order to reduce the influence of light and shadow:

Figure GDA0003305861490000081
Figure GDA0003305861490000081

其中,(r,g,b)代表原图像中某点的像素值,(R,G,B)表示归一化处理后的图像中该点的像素值;Among them, (r, g, b) represents the pixel value of a point in the original image, and (R, G, B) represents the pixel value of the point in the normalized image;

由于手指颜色为肤色,因此提取归一化处理后的各图像的红色通道图像,进行后续的图像匹配处理;Since the finger color is skin color, extract the red channel image of each image after normalization processing, and perform subsequent image matching processing;

步骤14、图像匹配;为了减少匹配时间,先将所有图像尺寸都缩小到50×50像素的大小;对于每个样本,分别计算其对应的红色通道图像与数据库中各张图像对应的红色通道图像的欧式距离,将结果中的最小值作为该样本的匹配分数;求取20个样本匹配分数的均值μIm与方差σIm,本实施例中计算得到两者分别为35.08和10.01;若μImIm<Th,则认定当前图像中的场景为手指放在阅读文本上,可进行后续的指尖定位操作,其中Th为阈值,本实施例中设置为150。Step 14: Image matching; in order to reduce the matching time, first reduce the size of all images to a size of 50×50 pixels; for each sample, calculate the corresponding red channel image and the red channel image corresponding to each image in the database. The Euclidean distance of , the minimum value in the result is taken as the matching score of the sample; the mean μ Im and the variance σ Im of the matching scores of 20 samples are obtained, and in this embodiment, the two are calculated to be 35.08 and 10.01 respectively; if μ ImIm <Th, it is determined that the scene in the current image is that the finger is placed on the reading text, and subsequent fingertip positioning operations can be performed, where Th is the threshold, which is set to 150 in this embodiment.

步骤2:在当前图像中定位用户指尖,如图2,作为后续文本检测的光标;具体处理过程如下:Step 2: Position the user's fingertip in the current image, as shown in Figure 2, as the cursor for subsequent text detection; the specific processing process is as follows:

步骤21、使用K-means找到用户指尖的候选区域;Step 21. Use K-means to find the candidate area of the user's fingertip;

首先,使用高斯滤波器对当前图像(如图3(a)所示)进行滤波,以减少异常点的干扰,提高聚类的准确程度。高斯滤波所采用的高斯核矩阵H的大小为(2kG+1)×(2kG+1),高斯核矩阵H的计算公式为:First, the current image (as shown in Figure 3(a)) is filtered with a Gaussian filter to reduce the interference of outliers and improve the accuracy of clustering. The size of the Gaussian kernel matrix H used in Gaussian filtering is (2k G +1)×(2k G +1). The calculation formula of the Gaussian kernel matrix H is:

Figure GDA0003305861490000091
Figure GDA0003305861490000091

上式中,H(i,j)为高斯核矩阵H第i行第j列的元素值,i,j=1,2,…,2kG+1;σG为高斯核函数的宽度参数,控制了函数的径向作用范围,本实施例中设置为3;kG为控制高斯核大小的参数,本实施例中设置为15;In the above formula, H(i,j) is the element value of the ith row and the jth column of the Gaussian kernel matrix H, i,j=1,2,...,2k G +1; σ G is the width parameter of the Gaussian kernel function, The radial action range of the function is controlled, and is set to 3 in this embodiment; k G is a parameter that controls the size of the Gaussian kernel, and is set to 15 in this embodiment;

然后,根据滤波后的图像(RGB图像)中三个通道的图像生成三个二维矩阵,每个二维矩阵中的元素值为相应通道的图像上相应点的像素值;Then, three two-dimensional matrices are generated according to the images of the three channels in the filtered image (RGB image), and the element value in each two-dimensional matrix is the pixel value of the corresponding point on the image of the corresponding channel;

对每个二维矩阵分别根据下式进行计算,得到一个列向量和一个行向量:Calculate each two-dimensional matrix according to the following formula to obtain a column vector and a row vector:

Figure GDA0003305861490000092
Figure GDA0003305861490000092

Figure GDA0003305861490000093
Figure GDA0003305861490000093

其中,mc_ave为将二维矩阵的所有列求和再取平均值的得到的列向量;M(:,j)是二维矩阵的第j列,col表示二维矩阵的总列数;mr_ave为将二维矩阵的所有行求和再取平均值的得到的行向量;M(i,:)是二维矩阵的第i行,row表示二维矩阵的总行数;Among them, m c_ave is the column vector obtained by summing all the columns of the two-dimensional matrix and then taking the average value; M(:,j) is the jth column of the two-dimensional matrix, and col represents the total number of columns of the two-dimensional matrix; m r_ave is the row vector obtained by summing and averaging all rows of the two-dimensional matrix; M(i,:) is the i-th row of the two-dimensional matrix, and row represents the total number of rows of the two-dimensional matrix;

由此,把当前图像转化成三个row×1列向量(如图3(b)、(c)所示)和三个1×col的行向量;将列向量的每个维度作为一个纵向数据点,把三个列向量相同维度的分量作为相应的纵向数据点的三个特征,构成该纵向数据点的特征向量,纵向数据点的个数等于列向量的维度,即row;将行向量的每个维度作为一个横向数据点,把三个行向量相同维度的分量作为相应的横向数据点的三个特征,构成该横向数据点的特征向量,横向数据点的个数等于行向量的维度,即col;Thus, the current image is converted into three row×1 column vectors (as shown in Figure 3(b), (c)) and three 1×col row vectors; each dimension of the column vector is used as a longitudinal data point, the components of the same dimension of the three column vectors are used as the three features of the corresponding longitudinal data point to form the feature vector of the longitudinal data point, and the number of longitudinal data points is equal to the dimension of the column vector, namely row; Each dimension is used as a horizontal data point, and the components of the same dimension of the three row vectors are used as the three features of the corresponding horizontal data point to form the feature vector of the horizontal data point. The number of horizontal data points is equal to the dimension of the row vector, i.e. col;

其次,使用K-means分别对纵向数据点和横向数据点进行聚类。在此过程中,为了防止得到的是局部最优解,聚类过程采用两次随机初始聚类中心,得到两次聚类的结果,评估这两次聚类结果的紧凑度,选取紧凑度好的聚类结果作为最终的聚类结果。Second, K-means is used to cluster the longitudinal and transverse data points separately. In this process, in order to prevent the local optimal solution from being obtained, the clustering process uses two random initial clustering centers to obtain two clustering results, evaluate the compactness of the two clustering results, and select the best compactness. The clustering result is used as the final clustering result.

分别采用下式评估两次聚类结果的紧凑度:The following formulas are used to evaluate the compactness of the two clustering results:

Figure GDA0003305861490000101
Figure GDA0003305861490000101

上式中,K代表聚类数目,值为2,N代表某一类的样本数目,xi代表第j类的第i个数据点的特征向量,mj代表第j类的聚类中心。由该式得到的μscore是一次聚类结果中每个点到其相应的聚类中心的平方距离之和,能够反映该次聚类结果的紧凑度,μscore越大,紧凑性越差;选取μscore较低的聚类结果作为最终的聚类结果。In the above formula, K represents the number of clusters, the value is 2, N represents the number of samples of a certain class, xi represents the feature vector of the ith data point of the jth class, and mj represents the cluster center of the jth class. The μ score obtained by this formula is the sum of the squared distances from each point in the clustering result to its corresponding cluster center, which can reflect the compactness of the clustering result. The larger the μ score , the worse the compactness; The clustering result with lower μ score is selected as the final clustering result.

将纵向数据点的聚类的结果表示为一个纵向标签向量,其为一个row×1的列向量,其各维度的分量表示为相应维度的纵向数据点的标签,取值为0或1;由于聚类过程中与手指有关的数据点更可能分为一类,即聚类结果中0和1都是聚集在一起,又因为手指一般出现在图片的中下方,所以标签向量的元素取值一般靠上部分基本上全为0,靠下部分基本全为1;为了清除异常数据点的标签,用一维的均值滤波来对标签向量进行滤波,即在标签向量上对目标数据给一个模板,模板的大小通常为奇数,该模板包括其周围的临近数据(如模板的大小为5,则模板包括的数据为目标数据右边相邻的两个数据和左边相邻的两个数据,不包括其自身),再用模板中的全体数据的平均值来代替目标数据;对于滤波后的标签向量进行阈值处理,若某维度元素值大于或等于设定阈值,则将其设置为1,否则将其设置0(阈值根据经验设定,本实施例中设置为0.4),得到最终的标签向量。标签向量中元素值0和1的分界点在当前图像中的对应的水平线即为用户手指在图像中的纵向截止位置;The result of the clustering of longitudinal data points is expressed as a longitudinal label vector, which is a row × 1 column vector, and the components of each dimension are expressed as the labels of the longitudinal data points of the corresponding dimension, which are 0 or 1; since In the clustering process, data points related to fingers are more likely to be classified into one category, that is, 0 and 1 in the clustering result are clustered together, and because fingers generally appear in the middle and lower parts of the picture, the elements of the label vector are generally valued. The upper part is basically all 0, and the lower part is basically all 1; in order to remove the labels of abnormal data points, one-dimensional mean filtering is used to filter the label vector, that is, the target data is given a template on the label vector, The size of the template is usually an odd number, and the template includes the adjacent data around it (for example, the size of the template is 5, the data included in the template are the two adjacent data on the right and the two adjacent data on the left of the target data, excluding the adjacent data. itself), and then replace the target data with the average value of all the data in the template; perform threshold processing on the filtered label vector, if the value of a dimension element is greater than or equal to the set threshold, set it to 1, otherwise set it to 1 Set 0 (the threshold is set according to experience, and is set to 0.4 in this embodiment) to obtain the final label vector. The horizontal line corresponding to the demarcation point of element values 0 and 1 in the label vector in the current image is the vertical cut-off position of the user's finger in the image;

将横向数据点的聚类的结果表示为一个横向标签向量,其为一个1×col的行向量,其各维度的分量表示为相应维度的横向数据点的标签,取值为0或1;由于手指一般出现在图像的中部,所以该标签向量的元素取值分布为中间部分为1,两侧为0,因此选择该标签向量中元素值0和1的左侧分界点在当前图像中的对应的竖直线即为用户手指在图像中的横向截止位置;The result of the horizontal data point clustering is represented as a horizontal label vector, which is a 1×col row vector, and the components of each dimension are represented as the labels of the horizontal data points of the corresponding dimension, which take the value of 0 or 1; since The finger generally appears in the middle of the image, so the value distribution of the elements of the label vector is 1 in the middle and 0 on both sides, so select the corresponding left dividing point of element values 0 and 1 in the label vector in the current image. The vertical line is the horizontal cut-off position of the user's finger in the image;

大量实验结果表明:纵向截止位置和横向截止位置的交点在手指的左上方;由于对标签向量进行过均值滤波,会导致0和1的左侧分界点向0的部分进行偏移,使交点在手指的左上方,因此以交点为左上顶点,即可划定一个足够大的矩形区域(矩形区域具体大小和输入图像的大小有关,本实施例中输入图像大小为480×640,设置矩形区域长为320,宽为160)作为用户指尖的候选区域(即用户指尖可能存在的区域),如图3(d)、(e)所示。A large number of experimental results show that the intersection of the vertical cut-off position and the horizontal cut-off position is at the upper left of the finger; due to the mean filtering of the label vector, the left boundary point of 0 and 1 will be shifted to the 0 part, so that the intersection is at The upper left of the finger, so taking the intersection point as the upper left vertex, a large enough rectangular area can be delineated (the specific size of the rectangular area is related to the size of the input image. In this embodiment, the size of the input image is 480×640, and the length of the rectangular area is set is 320, and the width is 160) as the candidate area of the user's fingertip (ie, the area where the user's fingertip may exist), as shown in Figures 3(d) and (e).

步骤22、通过计算曲率定位指尖;Step 22. Position the fingertip by calculating the curvature;

首先采用canny算子求取用户指尖的候选区域中的边缘,连接边缘得到轮廓;若得到多个轮廓,则通过设置的轮廓大小阈值(实验结果表明此阈值设置为100时效果最好),将包含像素点个数小于设定阈值的轮廓排除,以排除孤立点的干扰,将包含像素点个数不小于设定阈值的轮廓保留;First, the canny operator is used to obtain the edge in the candidate area of the user's fingertip, and the edges are connected to obtain the contour; if multiple contours are obtained, the set contour size threshold (experimental results show that the threshold is set to 100 is the best), Eliminate the contours with the number of pixels less than the set threshold to eliminate the interference of isolated points, and retain the contours with the number of pixels not less than the set threshold;

根据曲线的参数方程:According to the parametric equation of the curve:

Γ(t)=(x(t),y(t));Γ(t)=(x(t),y(t));

其中,t为参数,x(t)为曲线横坐标关于t的方程,y(t)为曲线纵坐标关于t的方程,Γ(t)为曲线关于t的方程,则可知曲线的曲率计算公式为:Among them, t is the parameter, x(t) is the equation of the abscissa of the curve with respect to t, y(t) is the equation of the ordinate of the curve with respect to t, and Γ(t) is the equation of the curve with respect to t, then the curvature calculation formula of the curve can be known. for:

Figure GDA0003305861490000111
Figure GDA0003305861490000111

其中,

Figure GDA0003305861490000112
Figure GDA0003305861490000113
分别为x(t)和y(t)的一阶导数,
Figure GDA0003305861490000114
Figure GDA0003305861490000115
分别为为x(t)和y(t)的二阶导数。in,
Figure GDA0003305861490000112
and
Figure GDA0003305861490000113
are the first derivatives of x(t) and y(t), respectively,
Figure GDA0003305861490000114
and
Figure GDA0003305861490000115
are the second derivatives of x(t) and y(t), respectively.

在本发明中,由于轮廓是像素点的集合,为了求取轮廓中每个像素点的曲率,按照如下方式进行计算。In the present invention, since the contour is a collection of pixel points, in order to obtain the curvature of each pixel point in the contour, the calculation is performed as follows.

首先,为了减少噪声对曲率测量的影响,需对曲线进行平滑处理。根据一维高斯函数生成一维高斯核;定义一维高斯核的大小为M,其中M是最接近10σ的奇数中较小的数,σ为一维高斯函数的宽度参数,控制了函数的径向作用范围,在实验中值为3,M即为31,。具体操作为将轮廓上每个像素点的坐标与一维高斯核做卷积,该操作可定义如下:First, in order to reduce the influence of noise on the curvature measurement, the curve needs to be smoothed. Generate a one-dimensional Gaussian kernel according to the one-dimensional Gaussian function; define the size of the one-dimensional Gaussian kernel as M, where M is the smaller of the odd numbers closest to 10σ, and σ is the width parameter of the one-dimensional Gaussian function, which controls the diameter of the function. In the direction of the action range, the value in the experiment is 3, and M is 31. The specific operation is to convolve the coordinates of each pixel on the contour with a one-dimensional Gaussian kernel, which can be defined as follows:

Figure GDA0003305861490000116
Figure GDA0003305861490000116

Figure GDA0003305861490000117
Figure GDA0003305861490000117

其中L=(M-1)/2=15,X(npoint),Y(npoint)为轮廓上第npoint个像素点的横纵坐标(坐标系建立为:原点O为图像中的左上顶点,横轴为y轴,取右方向为正,纵轴为x轴,取下方向为正),g(k,σ)为一维高斯核的第k个权重值,X(n,σ)和Y(n,σ)为平滑处理后轮廓上第n个像素点的坐标值,npoint的取值由下式决定:Wherein L=(M-1)/2=15, X(n point ), Y(n point ) are the horizontal and vertical coordinates of the nth point pixel on the contour (the coordinate system is established as: the origin O is the upper left in the image Vertex, the horizontal axis is the y-axis, the right direction is positive, the vertical axis is the x-axis, and the downward direction is positive), g(k, σ) is the k-th weight value of the one-dimensional Gaussian kernel, X(n, σ ) and Y(n,σ) are the coordinates of the nth pixel on the contour after smoothing, and the value of n point is determined by the following formula:

Figure GDA0003305861490000121
Figure GDA0003305861490000121

其中nsize为轮廓包含像素点的个数,n,npoint=1,2,…,nsize。根据卷积的性质,可以计算求得X(n,σ)和Y(n,σ)的一阶和二阶导数如下:where n size is the number of pixels included in the contour, n,n point =1,2,...,n size . According to the properties of convolution, the first and second derivatives of X(n,σ) and Y(n,σ) can be calculated as follows:

Figure GDA0003305861490000122
Figure GDA0003305861490000122

Figure GDA0003305861490000123
Figure GDA0003305861490000123

Figure GDA0003305861490000124
Figure GDA0003305861490000124

Figure GDA0003305861490000125
Figure GDA0003305861490000125

其中,

Figure GDA0003305861490000126
Figure GDA0003305861490000127
分别表示根据一维高斯函数的一阶和二阶导数生成的一维高斯核的第k个权重值;in,
Figure GDA0003305861490000126
and
Figure GDA0003305861490000127
respectively represent the kth weight value of the one-dimensional Gaussian kernel generated according to the first-order and second-order derivatives of the one-dimensional Gaussian function;

由于轮廓在第n个像素点的曲率计算表达式如下:Since the curvature of the contour at the nth pixel is calculated as follows:

Figure GDA0003305861490000128
Figure GDA0003305861490000128

因此,对于所保留下来的轮廓,可以计算出轮廓每个像素点的曲率,结果如图3(f)所示,其中指尖对应曲率为零的点,由此即可得到用户指尖位置。Therefore, for the retained contour, the curvature of each pixel of the contour can be calculated, and the result is shown in Figure 3(f), where the fingertip corresponds to a point with zero curvature, and thus the position of the user's fingertip can be obtained.

步骤3:根据用户指尖的位置,确定用户指示的文本行;具体处理过程如下:Step 3: Determine the text line indicated by the user according to the position of the user's fingertip; the specific processing process is as follows:

步骤31、文本区域的提取;Step 31, the extraction of the text area;

首先把从相机所拍的彩色图像转换成灰度图像,见图5(a),然后对此灰度图像使用Ostu自适应阈值法对图像进行二值化处理,见图5(b),得到此图像的前景区域(手指)和后景区域(文本文档)。由于后景区域中可能包含非文本区域,因此需要进一步加强约束条件以排除后景区域中的非文本区域。为此,首先,提取出图像后景区域中所有的连通区域,构成集合CR;然后,对集合CR中的每个连通区域求取其旋转矩形(最小外接矩形),记为ζ(ο,θ,w,h);其中o代表旋转矩形的中心,θ代表旋转矩形所偏转的角度(旋转角度),是水平轴(x轴)逆时针旋转,与碰到的矩形的第一条边的夹角,其范围为(-π/2,0),w和h分别代表旋转矩形相邻的两条边,如图4所示。然后,依据下列约束条件过滤集合CR中的非文本区域。First, convert the color image taken by the camera into a grayscale image, as shown in Figure 5(a), and then use the Ostu adaptive threshold method to binarize the grayscale image, as shown in Figure 5(b), to obtain The foreground area (finger) and background area (text document) of this image. Since the background area may contain non-text areas, the constraints need to be further strengthened to exclude non-text areas in the background area. To this end, first, extract all the connected regions in the background area of the image to form a set CR ; then, obtain its rotation rectangle (minimum circumscribed rectangle) for each connected region in the set CR , denoted as ζ(o ,θ,w,h); where o represents the center of the rotating rectangle, θ represents the angle (rotation angle) deflected by the rotating rectangle, which is the counterclockwise rotation of the horizontal axis (x-axis), and the first side of the rectangle that encounters The included angle of , whose range is (-π/2,0), w and h respectively represent the two adjacent sides of the rotated rectangle, as shown in Figure 4. Then, the non-text regions in the set CR are filtered according to the following constraints.

1)面积过滤。通过大量实验可以发现文本区域的面积大小是在一定的范围内,即满足:1) Area filtering. Through a large number of experiments, it can be found that the area size of the text area is within a certain range, that is, it satisfies:

tminArea<tmaxt minArea <t max ;

其中,ζArea=wh,tmin和tmax分别为文本区域的上下面积阈值,其取值是根据多次实验测量文本区域的面积所确定的,本实施例中分别设为100和1500;Wherein, ζ Area =wh, t min and t max are the upper and lower area thresholds of the text area, respectively, and the values are determined according to the area of the text area measured by multiple experiments, and are respectively set to 100 and 1500 in this embodiment;

而非本区域的面积是随机大小的,由此即可根据区域面积进行非文本区域的第一次过滤。即对于集合C中的每个连通区域,若其旋转矩形不满足约束条件tminArea<tmax,则该旋转矩形可认为是非文本区域,将其从集合CR中删除,集合CR中剩余连通区域构成集合CR1;结果如图5(c)所示;The area of the non-text area is randomly sized, so the first filtering of the non-text area can be performed according to the area area. That is, for each connected area in the set C, if its rotating rectangle does not satisfy the constraint condition t minArea <t max , the rotating rectangle can be considered as a non-text area, and it is deleted from the set CR , and the set CR The remaining connected regions in the middle form a set C R1 ; the result is shown in Figure 5(c);

2)角度过滤。由于本发明所关注的场景为纯文本和带有插图的文本(如书籍等)而不是在复杂图像中的文本(如海报等),所以,在集合CR1中,真正文本区域较非文本区域有绝对的数量优势。且注意到26个英文字母中除了o是一个例外(其旋转矩形所偏转的角度总是零),其它字母文本区域的旋转矩形所偏转的角度均相差不大,即大部分文本区域的旋转矩形所偏转的角度θ都满足如下约束条件:2) Angle filtering. Since the scene concerned by the present invention is plain text and text with illustrations (such as books, etc.) rather than text in complex images (such as posters, etc.), in the set C R1 , the real text area is more than the non-text area There is an absolute quantitative advantage. And note that in the 26 English letters, except o is an exception (the angle deflected by the rotating rectangle is always zero), the angle deflected by the rotating rectangle of the text area of other letters is not much different, that is, the rotating rectangle of most of the text areas. The deflected angle θ all satisfy the following constraints:

|θ-μθ|<σθ|θ-μ θ |<σ θ ;

其中μθ和σθ分别为经过步骤1)处理后的集合C中所有连通区域的旋转矩形所偏转的角度的均值与方差。where μ θ and σ θ are the mean and variance of the angles deflected by the rotating rectangles of all connected regions in the set C processed in step 1).

因此,对于集合C1中的每个连通区域,若其旋转矩形不满足约束条件|θ-μθ|<σθ,则该旋转矩形可认为是非文本区域,将其从集合CR1中删除,集合CR1中剩余连通区域构成集合CR2。尽管约束条件|ζ(θ)-μθ|<σθ有可能会把图像边缘的文本区域清除,但由于文本检测与识别处理主要是针对图像的中部区域,因此这一处理不会对文本最终的识别准确性造成很大的影响,结果如图5(d)所示。Therefore, for each connected region in the set C1, if its rotated rectangle does not satisfy the constraint |θ-μ θ |<σ θ , then the rotated rectangle can be considered as a non-text region, and it is deleted from the set C R1 , and the set The remaining connected regions in C R1 constitute a set C R2 . Although the constraint |ζ( θ )-μθ|< σθ may clear the text area at the edge of the image, since the text detection and recognition processing is mainly aimed at the middle area of the image, this processing will not affect the final text. The recognition accuracy has a great impact, and the results are shown in Figure 5(d).

3)基于文本区域之间的关系过滤。在本发明所关注的场景中,文本区域不是独立出现而是会形成文本行,因此属于同一个文本行的所有文本区域的旋转矩形中心会有线性关系,由此可根据属于同一个文本行的所有文本区域的旋转矩形中心拟合出基准直线,最终,根据非文本区域的旋转矩形中心和文本区域的旋转矩形中心距离基准直线的距离差异来确定文本区域,由此问题的关键变为如何确定图像中哪些文本区域属于同一个文本行,确定哪些文本区域属于同一个文本行后,即可根据这些文本区域的旋转矩形中心拟合得到基准直线。具体实现包括以下步骤:3) Filter based on the relationship between text areas. In the scene concerned by the present invention, the text areas do not appear independently but form text lines. Therefore, the centers of the rotated rectangles of all the text areas belonging to the same text line will have a linear relationship. The reference line is fitted to the centers of the rotated rectangles of all text areas. Finally, the text area is determined according to the distance difference between the center of the rotation rectangle of the non-text area and the center of the rotation rectangle of the text area from the reference line. The key of this problem becomes how to determine Which text regions in the image belong to the same text line, and after determining which text regions belong to the same text line, a reference straight line can be obtained by fitting according to the centers of the rotated rectangles of these text regions. The specific implementation includes the following steps:

3.1)将集合CR2中的每个连通区域R的旋转矩形中心作为一个关注点,将图像中的每条3.1) Taking the center of the rotated rectangle of each connected region R in the set C R2 as a point of interest,

过关注点的直线,改写其直线方程y=kx+b为如下形式:For a straight line passing through the point of interest, rewrite its straight line equation y=kx+b into the following form:

Figure GDA0003305861490000141
Figure GDA0003305861490000141

其中,θSL为直线与x轴的夹角,ρSL为原点O到直线的距离(原点O为图像中的左上顶点,横轴为y轴,取右方向为正,纵轴为x轴,取下方向为正),

Figure GDA0003305861490000142
若把直线的参数(ρSLSL)当成未知量,则在(ρSLSL)参数空间该直线将对应为一条正弦曲线。其中,θSL的取值范围(-π/2,π/2)(逆时针旋转为正,顺时针旋转为负),ρSL的取值范围为(-D,D),D为相机拍摄的原始图像对角线的长;Among them, θ SL is the angle between the straight line and the x-axis, ρ SL is the distance from the origin O to the straight line (the origin O is the upper left vertex in the image, the horizontal axis is the y-axis, the right direction is positive, and the vertical axis is the x-axis, remove direction is positive),
Figure GDA0003305861490000142
If the parameters of the straight line (ρ SL , θ SL ) are regarded as unknown quantities, the straight line will correspond to a sine curve in the (ρ SL , θ SL ) parameter space. Among them, the value range of θ SL (-π/2, π/2) (counterclockwise rotation is positive, clockwise rotation is negative), the value range of ρ SL is (-D, D), D is the camera shooting The length of the original image diagonal;

3.2)把(ρSLSL)参数空间细分为多个累加器单元,将坐标(ρkk)处的累加器单元的值记为A(ρkk);在本实施例中,θk取为(-90,90)的整数度数;首先将全部累计器单元都设置为零,然后分别计算每个关注点(xi,yi)到直线xcosθk+ysinθk=ρk的距离d:3.2) Subdivide the (ρ SL , θ SL ) parameter space into multiple accumulator units, and denote the value of the accumulator unit at the coordinates (ρ k , θ k ) as A(ρ k , θ k ); in this In the embodiment, θ k is taken as an integer degree of (-90, 90); first, all accumulator units are set to zero, and then each focus point (x i , y i ) is calculated separately to the straight line xcosθ k +ysinθ k = distance d of ρ k :

d=|xicosθk+yisinθkk|;d=|x i cosθ k +y i sinθ kk |;

对所有关注点到直线xcosθk+ysinθk=ρk的距离依次进行判断,每有一个关注点对应的距离小于阈值,则A(ρkk)值加1,判断完之后,得到最终的A(ρkk)值,本次实验最终结果如图6(c)所示。最终的A(ρkk)值表示以直线xcosθk+ysinθk=ρk为轴的长条状区域所包含的关注点的个数,因此A(ρkk)值越高,就代表这条直线是基准直线的概率越大。由此可设定一个阈值,本实施例中设置为7,一旦A(ρkk)值高于该阈值就认为相应直线为参考直线,记得到的参考直线条数为N;The distances from all attention points to the straight line xcosθ k +ysinθ kk are judged in turn. For each concern point corresponding to a distance smaller than the threshold, the value of A(ρ kk ) is increased by 1. After the judgment, the final result is obtained. The A(ρ k , θ k ) value of , the final result of this experiment is shown in Figure 6(c). The final A(ρ k , θ k ) value represents the number of points of interest contained in the long strip with the straight line xcosθ k +ysinθ kk as the axis, so the higher the A(ρ k , θ k ) value is , which means that the probability that this line is the reference line is greater. Thus, a threshold value can be set, which is set to 7 in this embodiment. Once the value of A(ρ k , θ k ) is higher than the threshold value, the corresponding straight line is considered to be a reference straight line, and the number of reference straight lines remembered is N;

3.3)通过无监督线聚类的方法找到最可能的文本区域,具体过程为:3.3) Find the most likely text region by unsupervised line clustering, the specific process is:

3.31)输入关注点集合,初始化基准直线集合CL,其中包括N条基准直线,分别为步骤3.2求得的N条参考直线;3.31) Input the set of attention points, and initialize the reference line set CL , which includes N reference lines, which are respectively the N reference lines obtained in step 3.2;

3.32)计算关注点集合中的所有关注点到集合CL中每条基准直线的距离;对于每个关注点,取最小距离值;筛选出最小距离值小于设定阈值的点;对筛选出的关注点标记类别,将最小距离值对应同一条基准直线的关注点归于同一类;3.32) Calculate the distance from all points of interest in the point of interest set to each reference line in the set CL ; for each point of interest, take the minimum distance value; filter out the points whose minimum distance value is less than the set threshold; Marking the category of attention points, and assigning the attention points whose minimum distance value corresponds to the same reference line to the same category;

3.33)对相同类别的关注点进行直线拟合,并判断新拟合出的直线与该类别对应的基准直线的斜率和截距的差值是否小于设定阈值,若小于,则集合CL中该类别对应的基准直线保持不变,否则,将集合CL中该类别对应的基准直线更新为将新拟合出的直线;若此步骤中集合CL中所有的基准直线均保持不变,则输出带有类别标记的关注点集合CS,将这些关注点对应的旋转矩形所包含的区域即为最可能的文本区域,否则返回步骤3.31);聚类所得的最终基准直线如图6(d)所示,最终所得的点集如图6(e)所示。3.33) Perform straight line fitting on the points of interest of the same category, and judge whether the difference between the slope and intercept of the newly fitted straight line and the reference straight line corresponding to the category is less than the set threshold, if it is less than the set CL The reference line corresponding to this category remains unchanged, otherwise, update the reference line corresponding to this category in the set CL to the newly fitted line; if all the reference lines in the set CL remain unchanged in this step, Then output a set of attention points C S with category labels, and the area contained in the rotation rectangle corresponding to these attention points is the most likely text area, otherwise return to step 3.31); the final reference line obtained by clustering is shown in Figure 6 ( d), the resulting point set is shown in Figure 6(e).

步骤32、确定文本行;Step 32, determine the text line;

根据步骤2得到的指尖位置,确定一个矩形的感兴趣区域,感兴趣区域的长和原始图像的长相等,感兴趣区域的宽设为固定值(设为原始图像的宽的六分之一),指尖位于在感兴趣区域底边上;According to the fingertip position obtained in step 2, determine a rectangular area of interest, the length of the area of interest is equal to the length of the original image, and the width of the area of interest is set to a fixed value (set to one-sixth of the width of the original image) ), the fingertip is located on the bottom edge of the region of interest;

对步骤31得到的文本区域,如图8(a)所示,求取其中每个字符的最外层轮廓,轮廓之间的相邻点为8连通区域。由于对于大多数的英文字母,字母最底端的点都几乎在一条直线上,即便在有旋转时也是如此。因此,可选取每个轮廓的最底端的点作为基准点,然后根据基准点是否在感兴趣区域中,如果不在,则过滤掉这些基准点,如图7(b),后续只对感兴趣区域中的基准点进行操作。值得注意的是有一些特殊的字母,如g,q,y,j,p等字母的基准点都是在理想文本线以下,将这些字母的基准点作为异常基准点。每次选取三个相邻的基准点,以最小化这三个相邻的基准点到直线的距离之和为目标,进行直线拟合,得到多条直线。为了清除那些由异常基准点所拟合出来的直线,对于所有拟合出来的直线,分别进行评分,评分公式如下:For the text area obtained in step 31, as shown in Figure 8(a), the outermost contour of each character is obtained, and the adjacent points between the contours are 8 connected areas. Since for most English letters, the bottommost points of the letters are almost in a straight line, even when there is rotation. Therefore, the bottommost point of each contour can be selected as the reference point, and then according to whether the reference point is in the region of interest, if not, these reference points are filtered out, as shown in Figure 7(b), and then only the region of interest will be filtered. operate on the datum point. It is worth noting that there are some special letters, such as g, q, y, j, p and other letters whose reference points are below the ideal text line, and the reference points of these letters are regarded as abnormal reference points. Three adjacent reference points are selected each time, with the goal of minimizing the sum of the distances from these three adjacent reference points to the straight line, and a straight line is fitted to obtain multiple straight lines. In order to remove the straight lines fitted by the abnormal reference points, all the fitted straight lines are scored separately. The scoring formula is as follows:

Figure GDA0003305861490000151
Figure GDA0003305861490000151

其中,d(i)distance为筛选出的第i个基准点到直线的距离,n为筛选出的基准点的总数,μscore为得分,此分数越低代表结果越好。由此可选取分数最低的直线作为判定线,然后根据筛选出的各基准点到判定线的距离是否小于设定阈值过滤掉异常基准点,再基于剩下的所有基准点,以最小化这些基准点到直线的距离之和为目标,进行直线拟合,拟合出的直线即为用户指示的文本行,如图7(c)所示。Among them, d(i) distance is the distance from the i-th benchmark point to the straight line, n is the total number of benchmark points selected, and μ score is the score. The lower the score, the better the result. Therefore, the straight line with the lowest score can be selected as the judgment line, and then the abnormal reference points can be filtered out according to whether the distance between the selected reference points and the judgment line is less than the set threshold, and then based on all the remaining reference points, these reference points can be minimized. The sum of the distances from the point to the straight line is the target, and a straight line is fitted, and the fitted straight line is the text line indicated by the user, as shown in Figure 7(c).

步骤4:提取用户指示的文本行上的单词,将其转换为语音输出。具体处理过程如下:Step 4: Extract the words on the text line indicated by the user and convert it to speech output. The specific processing process is as follows:

步骤41、识别首个单词;Step 41, identify the first word;

对于步骤31提取出的文本区域,提取出其位于感兴趣区域中的部分,作为目标文本区域;分别求取目标文本区域中每个字符的最小外接矩形(旋转矩形),作为一个字母框;根据单词内的两个相邻字母框中心的距离与单词间两个相邻字母框中心的距离的差异,对字母框进行聚类,合成单词。沿着文本行,若当前字母框与下一个字母框的距离小于阈值,则认为这两个字母框属于一个单词,本实施例中该阈值设为20像素,重复上述过程直到文本行上的所有字母框都经过此判断,求取属于一个单词中的所有字母框的最小外接矩形作为单词框;For the text region extracted in step 31, extract its part in the region of interest as the target text region; respectively obtain the minimum circumscribed rectangle (rotated rectangle) of each character in the target text region as a letter box; The difference between the distance between the centers of two adjacent letter boxes within a word and the distance between the centers of two adjacent letter boxes between words is used to cluster the letter boxes to synthesize words. Along the text line, if the distance between the current letter box and the next letter box is less than the threshold, it is considered that these two letter boxes belong to one word. In this embodiment, the threshold is set to 20 pixels, and the above process is repeated until all the text lines are The letter boxes are all judged by this, and the smallest circumscribed rectangle belonging to all letter boxes in a word is obtained as the word box;

考虑到输入的图像往往不是沿水平方向的,而由于字母旋转会对最终识别的准确率有较大影响,因此根据图像中文本行与水平方向所呈角度(根据文本行斜率确定),对图像进行角度补偿,使其中文本行旋转至水平方向后再做后续处理,如图8(b)和图(c)所示。Considering that the input image is often not in the horizontal direction, and because the rotation of the letters will have a greater impact on the accuracy of the final recognition, according to the angle between the text line in the image and the horizontal direction (determined according to the slope of the text line), the image Perform angle compensation so that the text line is rotated to the horizontal direction before subsequent processing, as shown in Figure 8(b) and Figure (c).

选取沿着文本行的第一个单词框,然后使用tesseract OCR识别引擎进行识别,识别所返回的结果包括:单词、单词置信度和单词框。当返回的单词置信度大于阈值(实验中设置为80)时,则认为单词被正确识别,对正确识别的单词进行语音输出(单词会被大声的朗读出来)。Select the first word box along the text line, and then use the tesseract OCR recognition engine to recognize it. The results returned by the recognition include: word, word confidence, and word box. When the confidence of the returned word is greater than the threshold (set to 80 in the experiment), the word is considered to be correctly recognized, and the correctly recognized word is output by voice (the word will be read aloud).

值得注意的是上述识别操作的感兴趣域为图像的中部区域,识别过程中由于拍摄角度问题造成的图像与纸张之间的单应性影响由于对识别基本没有影响(由于拍摄角度并不是垂直拍摄,而是沿着指尖的方向进行拍摄,所拍摄出的图像中的文本相对于真实的文本会产生变形,但是感兴趣区域为图像的中部,产生变形小,所以对准确率基本没有影响),因此该干扰因素可被忽略。It is worth noting that the domain of interest of the above recognition operation is the middle area of the image, and the homography between the image and the paper caused by the shooting angle during the recognition process has little effect on the recognition (because the shooting angle is not vertical shooting. , but shoot along the direction of the fingertip, the text in the captured image will be deformed relative to the real text, but the region of interest is the middle of the image, and the deformation is small, so it has little effect on the accuracy) , so the interference factor can be ignored.

步骤42、采用模板匹配方法跟踪正确识别的单词的单词框在后续图像帧上的位置,以确定后续图像帧上的新单词区域,在新单词区域中识别新的单词;Step 42, adopt the template matching method to track the position of the word frame of the correctly identified word on the follow-up image frame, to determine the new word area on the follow-up image frame, and identify the new word in the new word area;

i.由于实际识别时图像由于运动模糊等原因造成图像清晰度不高,这会对单词的正确跟踪造成影响,因此先对各帧图像进行二值化处理;i. Due to the fact that the image is not high-definition due to motion blur and other reasons during actual recognition, this will affect the correct tracking of words, so first perform binarization processing on each frame of image;

初始化s=1,初始化指尖速度Vfingertip;将第l帧图像中识别出的第j个单词对应的单词框及其包含的所有字母框作为需要跟踪的单词框和字母框;将第m帧图像中的感兴趣区域作为搜索区域;初始化m=l+1;Initialize s=1, initialize the fingertip speed V fingertip ; take the word frame corresponding to the jth word identified in the lth frame image and all the letter frames it contains as the word frame and letter frame that need to be tracked; take the mth frame The region of interest in the image is used as the search region; initialize m=l+1;

ii.对需要跟踪的单词框/字母框,采用模板匹配的方法跟踪其在搜索区域中的位置;ii. For the word box/letter box that needs to be tracked, use template matching to track its position in the search area;

本发明采用标准平方差匹配算法,即最小化如下函数:The present invention adopts the standard squared difference matching algorithm, that is, the following function is minimized:

Figure GDA0003305861490000161
Figure GDA0003305861490000161

上式中,T(x',y')表示第l帧图像中需要跟踪的单词框/字母框上坐标(x',y')处的像素值,I(x+x',y+y')表示搜索区域中坐标(x+x',y+y')处的像素值,(x,y)代表搜索区域的左上顶点坐标。Rsq_diff越小,表示匹配越成功。因此,实际运算中当此指标值小于给定阈值时,即被认为匹配成功,即单词框/字母框跟踪成功;In the above formula, T(x', y') represents the pixel value at the coordinate (x', y') on the word frame/letter frame to be tracked in the lth frame of image, I(x+x', y+y ') represents the pixel value at the coordinates (x+x', y+y') in the search area, and (x, y) represents the coordinates of the upper left vertex of the search area. The smaller the R sq_diff , the more successful the match. Therefore, in the actual operation, when the index value is less than the given threshold, it is considered that the matching is successful, that is, the word box/letter box is tracked successfully;

若当前需要跟踪的单词框跟踪成功,则进入步骤iii;否则判断之前是否有跟踪成功的单词框,若有,则先在第m帧图像上与在最新跟踪成功单词框的匹配位置右方确定一条截止线,截止线与匹配位置右边重合;再判断截止线距离图像左边的宽度是否小于设定阈值(0.6倍的图像水平宽度),若是则将截止线右边的目标文本区域作为新单词区域,继续在新单词区域中识别新的单词,以使单词阅读与手指同步;否则结束新单词识别,由此即可完成单词的跟踪与识别;若之前没有跟踪成功的单词框,则将第m帧图像作为当前图像,对其执行步骤1~步骤41,进行首个单词的识别;If the current word frame to be tracked is successfully tracked, go to step iii; otherwise, determine whether there is a successfully tracked word frame before, if so, first determine the matching position on the m-th frame image with the matching position of the latest tracked successful word frame. A cutoff line, the cutoff line coincides with the right side of the matching position; then determine whether the width of the cutoff line from the left side of the image is less than the set threshold (0.6 times the horizontal width of the image), if so, the target text area to the right of the cutoff line is used as the new word area, Continue to recognize new words in the new word area to synchronize the word reading with the fingers; otherwise, end the new word recognition, so that the tracking and recognition of the word can be completed; The image is used as the current image, and steps 1 to 41 are performed on it to identify the first word;

若连续几帧图像都没有成功跟踪到任何单词框,则认为跟踪失败,这种情况往往是由于移动过快造成的;If no word frame is successfully tracked for several consecutive frames, it is considered that the tracking fails, which is often caused by moving too fast;

iii.更新指尖速度Vfingertipiii. Update fingertip velocity V fingertip :

Figure GDA0003305861490000171
Figure GDA0003305861490000171

其中,Vword,j代表跟踪成功的第j个单词框在第l帧图像和第m帧图像中的位置的水平距离差,Vletter,k代表跟踪成功的第k个字母框在在第l帧图像和第m帧图像中的位置的水平距离差,N1代表匹配成功的单词框数目,N2代表匹配成功的字母框数目;Among them, V word,j represents the horizontal distance difference between the position of the successfully tracked jth word box in the lth frame image and the mth frame image, V letter,k represents the successfully tracked kth letter box in the lth frame image The horizontal distance difference between the frame image and the position in the mth frame image, N 1 represents the number of successfully matched word boxes, and N 2 represents the number of successfully matched letter boxes;

这里计算指尖速度的作用有两个,其一是可以据此判断单词框在哪一帧图像后会移动出图像(若第l帧图像上单词框/字母框的左上顶点的横坐标减去Vfingertip×(m-l)小于零即可判断该单词框/字母框会移出第m帧图像)。当判断单词框会移出第m帧图像后,不会立即丢弃整个单词,而是会保留仍在第m帧图像中的该单词框剩余的字母,保留这些字母的字母框,因为根据指尖速度定义式可知它们会对指尖速度的计算产生影响。当判断字母框会移出第m帧图像后,则丢弃相应字母,这些字母框不参与指尖速度的计算。其二是提高跟踪的效率,即在下次进行单词跟踪时不去搜索整个图像,而是根据指尖速度划定一个矩形区域作为新的搜索区域,只在该搜索区域中进行跟踪;There are two functions for calculating the fingertip speed here. One is to judge which frame of image the word frame will move out of the image (if the abscissa of the upper left vertex of the word frame/letter frame on the first frame image is subtracted from the abscissa) If V fingertip ×(ml) is less than zero, it can be judged that the word box/letter box will move out of the mth frame image). When it is judged that the word box will move out of the mth frame image, the whole word will not be discarded immediately, but the remaining letters of the word box still in the mth frame image will be retained, and the letter boxes of these letters will be retained, because according to the speed of the fingertips The definition formula shows that they will affect the calculation of fingertip speed. When it is judged that the letter box will move out of the mth frame of image, the corresponding letters are discarded, and these letter boxes do not participate in the calculation of the fingertip speed. The second is to improve the efficiency of tracking, that is, the entire image is not searched for the next word tracking, but a rectangular area is delineated as a new search area according to the speed of the fingertip, and tracking is only performed in this search area;

iv.令s=s+1;对第l帧图像中识别出第s个单词对应的单词框及其包含的每个字母框,首先根据指尖速度判断该单词框和各个字母框是否会移出第m帧图像;将判断不会移出当前桢图像的单词框/字母框作为需要跟踪的单词框/字母框;并划定一个矩形区域作为新的搜索区域,该矩形区域的左上角横坐标=上一个进行跟踪的单词框的左上角横坐标-指尖速度-设定偏移值(本发明实施例中设定为15个像素),矩形区域的长=上一个进行跟踪的单词框的长+指尖速度,矩形区域的宽=上一个进行跟踪的单词框的宽+设定偏移值(本实施例中设定为30个像素),由于实际手指移动时不是平行移动,而是可能伴随之向下向上移动,因此加上一定的偏差值,偏差值根据实验获得;计算新的搜索区域中黑色像素与白色像素的比例,若此比例小于设定阈值(本实施例中设置为20%),则丢弃该帧图像,如图9(b)所示,即不继续在该帧图像中进行单词框的跟踪和识别,并令m=m+1,重新进行前述判断和处理直到新的搜索区域中黑色像素与白色像素的比不小于设定阈值,则进入步骤5);iv. Let s=s+1; for the word frame corresponding to the s-th word and each letter frame it contains in the l-th frame image, first judge whether the word frame and each letter frame will move out according to the fingertip speed. The mth frame image; the word box/letter box that is judged not to be moved out of the current frame image is used as the word box/letter box to be tracked; and a rectangular area is designated as a new search area, the abscissa of the upper left corner of the rectangular area = The abscissa of the upper left corner of the last tracked word box-fingertip speed-set offset value (set to 15 pixels in the embodiment of the present invention), the length of the rectangular area=the length of the last tracked word box + fingertip speed, the width of the rectangular area = the width of the last word box to be tracked + the set offset value (set to 30 pixels in this embodiment), because the actual finger movement is not a parallel movement, but a possible Along with moving down and up, add a certain deviation value, and the deviation value is obtained according to the experiment; calculate the ratio of black pixels and white pixels in the new search area, if this ratio is less than the set threshold (set to 20 in this embodiment) %), then discard the frame image, as shown in Figure 9(b), that is, do not continue to track and identify the word frame in this frame image, and let m=m+1, and re-do the aforementioned judgment and processing until a new The ratio of black pixels and white pixels in the search area of is not less than the set threshold, then enter step 5);

v.返回步骤ii。v. Return to step ii.

需要说明的是,以上公开的仅为本发明的具体实例,根据本发明提供的思想,本领域的技术人员能思及的变化,都应落入本发明的保护范围内。It should be noted that the above disclosures are only specific examples of the present invention, and changes that can be conceived by those skilled in the art according to the ideas provided by the present invention should fall within the protection scope of the present invention.

Claims (6)

1.一种针对盲人辅助阅读的文本检测与识别方法,其特征在于,包括以下步骤:1. a kind of text detection and identification method for the blind auxiliary reading, is characterized in that, comprises the following steps: 步骤1:对于相机拍摄的图像序列,判断当前图像中的场景是否为手指放在阅读文本上,若是则进行步骤2,否则跳过该帧当前图像,将下一帧图像作为当前图像,进行上述判断和处理;Step 1: For the image sequence captured by the camera, determine whether the scene in the current image is that the finger is placed on the reading text, if so, go to Step 2, otherwise skip the current image of the frame, and take the next frame of the image as the current image, and perform the above steps. judgment and processing; 步骤2:在当前图像中定位用户指尖;Step 2: Position the user's fingertip in the current image; 步骤3:根据用户指尖的位置,确定用户指示的文本行;Step 3: Determine the text line indicated by the user according to the position of the user's fingertip; 步骤4:提取用户指示的文本行上的单词,将其转换为语音输出;Step 4: Extract the words on the text line indicated by the user and convert them into speech output; 所述步骤2包括以下步骤:The step 2 includes the following steps: 步骤21、使用K-means找到用户指尖的候选区域;Step 21. Use K-means to find the candidate area of the user's fingertip; 首先,使用高斯滤波器对当前图像进行滤波;First, use a Gaussian filter to filter the current image; 然后,根据滤波后的图像中三个通道的图像生成三个二维矩阵,每个二维矩阵中的元素值为相应通道的图像上相应点的像素值;Then, three two-dimensional matrices are generated according to the images of the three channels in the filtered image, and the element value in each two-dimensional matrix is the pixel value of the corresponding point on the image of the corresponding channel; 对每个二维矩阵,将其所有列求和再取平均值,得到一个row×1的列向量mc_ave;将其所有行求和再取平均值,得到一个1×col的行向量mr_ave;由此,把当前图像转化成三个列向量和三个的行向量,其中col表示二维矩阵的总列数,row表示二维矩阵的总行数;For each two-dimensional matrix, sum all its columns and take the average to get a row×1 column vector m c_ave ; sum all its rows and take the average to get a 1×col row vector m r_ave ;Therefore, convert the current image into three column vectors and three row vectors, where col represents the total number of columns of the two-dimensional matrix, and row represents the total number of rows of the two-dimensional matrix; 将列向量的每个维度作为一个纵向数据点,把三个列向量相同维度的分量作为相应的纵向数据点的三个特征,构成该纵向数据点的特征向量,纵向数据点的个数等于列向量的维度,即row;将行向量的每个维度作为一个横向数据点,把三个行向量相同维度的分量作为相应的横向数据点的三个特征,构成该横向数据点的特征向量,横向数据点的个数等于行向量的维度,即col;Each dimension of the column vector is regarded as a longitudinal data point, and the components of the same dimension of the three column vectors are regarded as the three features of the corresponding longitudinal data point to form the feature vector of the longitudinal data point, and the number of longitudinal data points is equal to the number of columns. The dimension of the vector, namely row; take each dimension of the row vector as a horizontal data point, and take the components of the same dimension of the three row vectors as the three features of the corresponding horizontal data point to form the feature vector of the horizontal data point, the horizontal The number of data points is equal to the dimension of the row vector, that is, col; 其次,使用K-means分别对纵向数据点和横向数据点进行聚类,聚类数目均为2;Secondly, K-means is used to cluster the longitudinal data points and the horizontal data points respectively, and the number of clusters is 2; 再次,将纵向数据点的聚类的结果表示为一个纵向标签向量,其为一个row×1的列向量,其各维度的分量表示为相应维度的纵向数据点的标签,取值为0或1;将横向数据点的聚类的结果表示为一个横向标签向量,其为一个1×col的行向量,其各维度的分量表示为相应维度的横向数据点的标签,取值为0或1;Again, the result of the clustering of longitudinal data points is expressed as a longitudinal label vector, which is a row×1 column vector, and the components of each dimension are expressed as the labels of the longitudinal data points of the corresponding dimension, and the value is 0 or 1 ; Express the result of the clustering of the horizontal data points as a horizontal label vector, which is a 1×col row vector, and the components of each dimension are expressed as the labels of the horizontal data points of the corresponding dimension, taking the value 0 or 1; 分别对纵向标签向量和横向标签向量先进行均值滤波,再进行阈值处理,若某维度元素值大于或等于设定阈值,则将其设置为1,否则将其设置0,得到最终的纵向标签向量和横向标签向量;Perform mean filtering on the vertical label vector and horizontal label vector respectively, and then perform threshold processing. If the value of a dimension element is greater than or equal to the set threshold, set it to 1, otherwise set it to 0 to obtain the final vertical label vector and horizontal label vector; 将纵向标签向量中元素值0和1的分界点在当前图像中的对应的水平线与横向标签向量中元素值0和1的左侧分界点在当前图像中的对应的竖直线的交点为左上顶点,划定一个矩形区域作为用户指尖的候选区域;Set the intersection point of the corresponding horizontal line in the current image with the demarcation point of element values 0 and 1 in the vertical label vector and the corresponding vertical line in the current image of the left demarcation point of element values 0 and 1 in the horizontal label vector as the upper left Vertex, delineate a rectangular area as a candidate area for the user's fingertip; 步骤22、通过计算曲率定位指尖;Step 22. Position the fingertip by calculating the curvature; 首先,采用canny算子求取用户指尖的候选区域中的边缘,连接边缘得到轮廓;若得到多个轮廓,则只保留包含像素点个数不小于设定阈值的轮廓;First, the canny operator is used to obtain the edge in the candidate area of the user's fingertip, and the edge is connected to obtain the contour; if multiple contours are obtained, only the contour containing the number of pixels not less than the set threshold is retained; 然后,对保留下来的轮廓进行平滑处理;Then, smooth the remaining contours; 最后,对平滑处理后的轮廓,计算轮廓上每个像素点的曲率,曲率为零的点即为用户指尖位置。Finally, for the smoothed contour, the curvature of each pixel on the contour is calculated, and the point with zero curvature is the position of the user's fingertip. 2.根据权利要求1所述的针对盲人辅助阅读的文本检测与识别方法,其特征在于,所述步骤1中判断当前图像中的场景是否为手指放在阅读文本上的方法如下:2. the text detection and identification method for the blind reading assistance according to claim 1, is characterized in that, whether the scene in the described step 1 judges whether the scene in the current image is that the finger is placed on the method for reading the text as follows: 步骤11、通过相机预先拍摄一些典型的包含用户手指及其所在的文本区域的图像,保存于数据库中;Step 11. Pre-shoot some typical images containing the user's fingers and the text area where they are located by the camera, and save them in the database; 步骤12、将当前图像以及在该图像前面拍摄的多张图像作为样本图像;Step 12, using the current image and multiple images taken in front of the image as sample images; 步骤13、对数据库中的图像和样本图像的RGB色彩空间分别进行归一化处理;Step 13, normalize the RGB color space of the image in the database and the sample image respectively; 步骤14、对于每个样本,分别计算其归一化处理后的红色通道图像与数据库中各张图像归一化处理后的红色通道图像的欧式距离,将结果中的最小值作为该样本的匹配分数;求取所有样本匹配分数的均值μIm与方差σIm,若μImIm<Th,则认定当前图像中的场景为手指放在阅读文本上,其中Th为阈值,为经验参数。Step 14. For each sample, calculate the Euclidean distance between the normalized red channel image and the normalized red channel image of each image in the database, and use the minimum value in the result as the matching of the sample. Score; obtain the mean μ Im and variance σ Im of all sample matching scores, if μ ImIm <Th, the scene in the current image is determined to be a finger on the reading text, where Th is the threshold, which is an empirical parameter. 3.根据权利要求2所述的针对盲人辅助阅读的文本检测与识别方法,其特征在于,所述步骤14中,将所有图像都缩小到设定尺寸,再计算欧式距离。3 . The text detection and recognition method for assisted reading for the blind according to claim 2 , wherein, in the step 14, all images are reduced to a set size, and then the Euclidean distance is calculated. 4 . 4.根据权利要求1所述的针对盲人辅助阅读的文本检测与识别方法,其特征在于,使用K-means对纵向数据点/横向数据点进行聚类的过程中,随机初始化两组聚类中心,进行两次聚类,评估两次聚类结果的紧凑度,选取紧凑度好的聚类结果作为最终的聚类结果。4. the text detection and identification method for blind reading aids according to claim 1, is characterized in that, in the process of using K-means to carry out clustering to vertical data point/horizontal data point, initialize two groups of cluster centers at random , perform two clusterings, evaluate the compactness of the two clustering results, and select the clustering result with good compactness as the final clustering result. 5.根据权利要求1所述的针对盲人辅助阅读的文本检测与识别方法,其特征在于,所述步骤3包括以下步骤:5. the text detection and identification method for the blind reading assistance according to claim 1, is characterized in that, described step 3 comprises the following steps: 步骤31、文本区域的提取;Step 31, the extraction of the text area; 首先,当前图像转换成灰度图像;First, the current image is converted into a grayscale image; 然后,对灰度图像进行二值化处理,得到图像的前景区域和后景区域;Then, the grayscale image is binarized to obtain the foreground area and the background area of the image; 再排除后景区域中的非文本区域,方法如下:Then exclude the non-text areas in the background area as follows: 先提取出后景区域中所有的连通区域,构成集合CR;然后对集合CR中的每个连通区域求取其旋转矩形,记为ζ(ο,θ,w,h),其中o代表旋转矩形的中心,θ代表旋转矩形所偏转的角度,是水平轴逆时针旋转,与碰到的旋转矩形的第一条边的夹角,w和h分别代表旋转矩形相邻的两条边;再过滤掉CR中旋转矩形的面积和所偏转的角度不符合约束条件的连通区域,对于剩下的连通区域,基于文本区域之间的关系进行进一步过滤,具体实现包括以下步骤:First extract all the connected regions in the background area to form a set CR ; then each connected region in the set CR is obtained its rotation rectangle, denoted as ζ(ο, θ, w, h), wherein o represents The center of the rotating rectangle, θ represents the angle deflected by the rotating rectangle, which is the counterclockwise rotation of the horizontal axis, and the angle between the first side of the rotating rectangle encountered, w and h respectively represent the two adjacent sides of the rotating rectangle; Then filter out the connected areas in which the area of the rotated rectangle and the deflected angle in the CR do not meet the constraints. For the remaining connected areas, further filtering is performed based on the relationship between the text areas. The specific implementation includes the following steps: 3.1)以当前图像的左上顶点为原点O,当前图像长度方向为y轴,取右方向为正,当前图像宽度方向为x轴,取下方向为正;将每个连通区域R的旋转矩形中心作为一个关注点,将当前图像中的每条过关注点的直线,表示为如下形式:3.1) Take the upper left vertex of the current image as the origin O, the length direction of the current image is the y-axis, the right direction is positive, the width direction of the current image is the x-axis, and the removal direction is positive; As a point of interest, each line passing through the point of interest in the current image is represented as follows: xcosθSL+ysinθSL=ρSLxcosθ SL +ysinθ SLSL ; 其中,θSL为直线与x轴的夹角,ρSL为原点O到直线的距离,θSL的取值范围(-π/2,π/2),ρSL的取值范围为(-D,D),D为相机拍摄的原始图像对角线的长;Among them, θ SL is the angle between the straight line and the x-axis, ρ SL is the distance from the origin O to the straight line, the value range of θ SL is (-π/2, π/2), and the value range of ρ SL is (-D , D), D is the diagonal length of the original image captured by the camera; 3.2)把(ρSLSL)参数空间细分为多个累加器单元,将坐标(ρkk)处的累加器单元的值记为A(ρkk);首先将全部累计器单元都设置为零,然后分别计算每个关注点(xi,yi)到直线xcosθk+ysinθk=ρk的距离d:3.2) Subdivide the (ρ SL , θ SL ) parameter space into multiple accumulator units, and denote the value of the accumulator unit at the coordinates (ρ k , θ k ) as A(ρ k , θ k ); All accumulator cells are set to zero, and the distance d from each point of interest (x i , y i ) to the line xcosθ k +ysinθ kk is calculated separately: d=|xi cosθk+yi sinθkk|;d=|x i cosθ k +y i sinθ kk |; 对所有关注点到直线xcosθk+ysinθk=ρk的距离依次进行判断,每有一个关注点对应的距离小于阈值,则A(ρkk)值加1,判断完之后,得到最终的A(ρkk)值;若A(ρkk)值高于该阈值就认为相应直线为参考直线,记得到的参考直线条数为N;The distances from all attention points to the straight line xcosθ k +ysinθ kk are judged in turn. For each concern point corresponding to a distance smaller than the threshold, the value of A(ρ kk ) is increased by 1. After the judgment, the final result is obtained. The A(ρ k , θ k ) value of ; if the A(ρ k , θ k ) value is higher than the threshold value, the corresponding straight line is considered as a reference straight line, and the number of reference straight lines is N; 3.3)通过无监督线聚类的方法找到文本区域,具体过程为:3.3) Find the text area by the method of unsupervised line clustering, the specific process is as follows: 3.31)输入关注点集合,初始化基准直线集合CL,其中包括N条基准直线,分别为步骤3.2)求得的N条参考直线;3.31) Input the set of attention points, and initialize the reference straight line set CL , which includes N reference straight lines, which are respectively the N reference straight lines obtained in step 3.2); 3.32)计算关注点集合中的所有关注点到集合CL中每条基准直线的距离;对于每个关注点,取最小距离值;筛选出最小距离值小于设定阈值的点;对筛选出的关注点标记类别,将最小距离值对应同一条基准直线的关注点归于同一类;3.32) Calculate the distance from all points of interest in the point of interest set to each reference line in the set CL ; for each point of interest, take the minimum distance value; filter out the points whose minimum distance value is less than the set threshold; Marking the category of attention points, and assigning the attention points whose minimum distance value corresponds to the same reference line to the same category; 3.33)对相同类别的关注点进行直线拟合,并判断新拟合出的直线与该类别对应的基准直线的斜率和截距的差值是否小于设定阈值,若小于,则集合CL中该类别对应的基准直线保持不变,否则,将集合CL中该类别对应的基准直线更新为将新拟合出的直线;若此步骤中集合CL中所有的基准直线均保持不变,则输出带有类别标记的关注点集合CS,这些关注点对应的旋转矩形所包含的区域即为提取出的文本区域,否则返回步骤3.31);3.33) Perform straight line fitting on the points of interest of the same category, and judge whether the difference between the slope and intercept of the newly fitted straight line and the reference straight line corresponding to the category is less than the set threshold, if it is less than the set CL The reference line corresponding to this category remains unchanged, otherwise, update the reference line corresponding to this category in the set CL to the newly fitted line; if all the reference lines in the set CL remain unchanged in this step, Then output the set of attention points C S with category marks, and the area contained in the rotation rectangle corresponding to these attention points is the extracted text area, otherwise return to step 3.31); 步骤32、确定文本行;Step 32, determine the text line; 根据步骤2得到的指尖位置,确定一个矩形的感兴趣区域,指尖位于在感兴趣区域的底边上;According to the fingertip position obtained in step 2, determine a rectangular region of interest, and the fingertip is located on the bottom edge of the region of interest; 对步骤31提取出的文本区域,求取其中每个字符的最外层轮廓;选取每个轮廓的最底端的点作为基准点,筛选出位于感兴趣区域中的基准点;To the text area that step 31 is extracted, seek the outermost contour of each character wherein; Select the bottom point of each contour as the reference point, and filter out the reference point located in the region of interest; 对于筛选出的基准点,每次选取三个相邻的基准点,进行直线拟合,得到多条直线;For the selected datum points, select three adjacent datum points each time, and perform straight line fitting to obtain multiple straight lines; 对于所有拟合出来的直线,分别进行评分,评分公式如下:All the fitted straight lines are scored separately. The scoring formula is as follows:
Figure FDA0003305861480000041
Figure FDA0003305861480000041
其中,d(i)distance为筛选出的第i个基准点到直线的距离,n为筛选出的基准点总数,μscore为得分;Among them, d(i) distance is the distance from the i-th benchmark point screened to the straight line, n is the total number of benchmark points screened out, and μ score is the score; 选取分数最低的直线作为判定线;过滤掉筛选出的基准点中到判定线的距离小于设定阈值的基准点,再基于剩下的所有基准点,进行直线拟合,拟合出的直线即为用户指示的文本行。Select the straight line with the lowest score as the judgment line; filter out the datum points whose distance to the judgment line is less than the set threshold, and then perform straight line fitting based on all the remaining datum points, and the fitted line is Line of text indicated for the user.
6.根据权利要求5所述的针对盲人辅助阅读的文本检测与识别方法,其特征在于,所述步骤4的具体处理过程如下:6. the text detection and identification method for blind reading aids according to claim 5, is characterized in that, the concrete processing process of described step 4 is as follows: 步骤41、识别首个单词;Step 41, identify the first word; 对于步骤31提取出的文本区域,提取出其位于感兴趣区域中的部分,作为目标文本区域;分别求取目标文本区域中每个字符的最小外接矩形,作为一个字母框;根据单词内的两个相邻字母框中心的距离与单词间两个相邻字母框中心的距离的差异,对字母框进行聚类,合成单词;求取属于一个单词中的所有字母框的最小外接矩形作为单词框;For the text area extracted in step 31, extract the part located in the area of interest as the target text area; respectively obtain the minimum circumscribed rectangle of each character in the target text area as a letter box; The difference between the distance between the centers of two adjacent letter boxes and the distance between the centers of two adjacent letter boxes between words, cluster the letter boxes, and synthesize words; find the smallest circumscribed rectangle belonging to all letter boxes in a word as the word box ; 根据图像中用户指示的文本行与水平方向所呈角度,对图像进行角度补偿,使其中文本行旋转至水平方向;Perform angle compensation on the image according to the angle between the text line indicated by the user and the horizontal direction in the image, so that the text line is rotated to the horizontal direction; 选取沿着文本行的第一个单词框,然后使用OCR识别技术进行识别,识别所返回的结果包括:单词、单词置信度和单词框;当返回的单词置信度大于阈值时,则认为单词被正确识别,对正确识别的单词进行语音输出;Select the first word box along the text line, and then use OCR recognition technology to identify, the results returned by the recognition include: word, word confidence and word box; when the returned word confidence is greater than the threshold, the word is considered to be Correct recognition, voice output for correctly recognized words; 步骤42、采用模板匹配方法跟踪正确识别的单词的单词框在后续图像帧上的位置,以确定后续图像帧上的新单词区域,在新单词区域中识别新的单词;Step 42, adopt the template matching method to track the position of the word frame of the correctly identified word on the follow-up image frame, to determine the new word area on the follow-up image frame, and identify the new word in the new word area; i.对各帧图像进行二值化处理;i. Binarize each frame of image; 初始化s=1,初始化指尖速度Vfingertip;将第l帧图像中识别出的第j个单词对应的单词框及其包含的所有字母框作为需要跟踪的单词框和字母框;将第m帧图像中的感兴趣区域作为搜索区域;初始化m=l+1;Initialize s=1, initialize the fingertip speed V fingertip ; take the word frame corresponding to the jth word identified in the lth frame image and all the letter frames it contains as the word frame and letter frame that need to be tracked; take the mth frame The region of interest in the image is used as the search region; initialize m=l+1; ii.对需要跟踪的单词框/字母框,采用模板匹配的方法跟踪其在搜索区域中的位置;ii. For the word box/letter box that needs to be tracked, use template matching to track its position in the search area; 若当前需要跟踪的单词框跟踪成功,则进入步骤iii;否则判断之前是否有跟踪成功的单词框,若有,则先在第m帧图像上与在最新跟踪成功单词框的匹配位置右方确定一条截止线,截止线与匹配位置右边重合;再判断截止线距离图像左边的宽度是否小于设定阈值,若是则将截止线右边的目标文本区域作为新单词区域,继续在新单词区域中识别新的单词;否则结束新单词识别;若之前没有跟踪成功的单词框,则将第m帧图像作为当前图像,对其执行步骤1~步骤41,进行首个单词的识别;If the current word frame to be tracked is successfully tracked, go to step iii; otherwise, determine whether there is a successfully tracked word frame before, if so, first determine the matching position on the m-th frame image with the matching position of the latest tracked successful word frame. A cutoff line, the cutoff line coincides with the right side of the matching position; then judge whether the width of the cutoff line from the left side of the image is less than the set threshold, if so, take the target text area to the right of the cutoff line as the new word area, and continue to identify new words in the new word area. Otherwise, end the new word recognition; if there is no word frame successfully tracked before, then take the mth frame image as the current image, and perform steps 1 to 41 on it to recognize the first word; iii.更新指尖速度Vfingertipiii. Update fingertip velocity V fingertip :
Figure FDA0003305861480000051
Figure FDA0003305861480000051
其中,Vword,j代表跟踪成功的第j个单词框在第l帧图像和第m帧图像中的位置的水平距离差,Vletter,k代表跟踪成功的第k个字母框在第l帧图像和第m帧图像中的位置的水平距离差,N1代表匹配成功的单词框数目,N2代表匹配成功的字母框数目;Among them, V word,j represents the horizontal distance difference between the position of the successfully tracked jth word box in the lth frame image and the mth frame image, and V letter,k represents the successfully tracked kth letter box in the lth frame The horizontal distance difference between the image and the position in the mth frame image, N 1 represents the number of successfully matched word boxes, and N 2 represents the number of successfully matched letter boxes; iv.令s=s+1;iv. Let s=s+1; 对第l帧图像中识别出第s个单词对应的单词框及其包含的每个字母框,根据指尖速度判断该单词框和各个字母框是否会移出第m帧图像,若第l帧图像上单词框/字母框的左上顶点的横坐标减去Vfingertip×(m-l)小于零则判定该单词框/字母框会移出第m帧图像;For the word box corresponding to the sth word and each letter box it contains in the lth frame image, judge whether the word box and each letter box will move out of the mth frame image according to the fingertip speed. If the lth frame image If the abscissa of the upper left vertex of the upper word box/letter box minus V fingertip ×(ml) is less than zero, it is determined that the word box/letter box will move out of the mth frame image; 将判断不会移出当前桢图像的单词框/字母框作为需要跟踪的单词框/字母框;并划定一个矩形区域作为新的搜索区域,该矩形区域的左上角横坐标=上一个进行跟踪的单词框的左上角横坐标-指尖速度-设定偏移值,矩形区域的长=上一个进行跟踪的单词框的长+指尖速度,矩形区域的宽=上一个进行跟踪的单词框的宽+设定偏移值;The word box/letter box that is judged not to be moved out of the current frame image is used as the word box/letter box that needs to be tracked; and a rectangular area is defined as a new search area, the abscissa of the upper left corner of the rectangular area = the previous tracked area The abscissa of the upper left corner of the word box - fingertip speed - set the offset value, the length of the rectangular area = the length of the last word box to be tracked + the speed of the fingertip, the width of the rectangular area = the length of the last word box to be tracked width + set offset value; 计算新的搜索区域中黑色像素与白色像素的比例,若此比例小于设定阈值,则丢弃该帧图像,并令m=m+1,重新进行前述判断和处理直到新的搜索区域中黑色像素与白色像素的比不小于设定阈值,则进入步骤v;否则直接进入步骤v;Calculate the ratio of black pixels to white pixels in the new search area, if this ratio is less than the set threshold, discard the frame image, and let m=m+1, and repeat the above judgment and processing until the black pixels in the new search area If the ratio to the white pixel is not less than the set threshold, then enter step v; otherwise, directly enter step v; v.返回步骤ii。v. Return to step ii.
CN201910501311.9A 2019-06-11 2019-06-11 Text detection and identification method for assisting reading of blind people Expired - Fee Related CN110458158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910501311.9A CN110458158B (en) 2019-06-11 2019-06-11 Text detection and identification method for assisting reading of blind people

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910501311.9A CN110458158B (en) 2019-06-11 2019-06-11 Text detection and identification method for assisting reading of blind people

Publications (2)

Publication Number Publication Date
CN110458158A CN110458158A (en) 2019-11-15
CN110458158B true CN110458158B (en) 2022-02-11

Family

ID=68480723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910501311.9A Expired - Fee Related CN110458158B (en) 2019-06-11 2019-06-11 Text detection and identification method for assisting reading of blind people

Country Status (1)

Country Link
CN (1) CN110458158B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597956B (en) * 2020-05-12 2023-06-02 四川久远银海软件股份有限公司 Picture and text recognition method based on deep learning model and relative azimuth calibration
CN112200738A (en) * 2020-09-29 2021-01-08 平安科技(深圳)有限公司 Method and device for identifying protrusion of shape and computer equipment
CN112836587B (en) * 2021-01-08 2024-06-04 中国商用飞机有限责任公司北京民用飞机技术研究中心 Runway identification method, device, computer equipment and storage medium
CN114419144B (en) * 2022-01-20 2024-06-14 珠海市一杯米科技有限公司 Card positioning method based on external contour shape analysis
CN114821782B (en) * 2022-04-26 2025-01-17 辽宁科技大学 Method and device for inputting text written in air
CN115019181B (en) * 2022-07-28 2023-02-07 北京卫星信息工程研究所 Remote sensing image rotating target detection method, electronic equipment and storage medium
CN115909342B (en) * 2023-01-03 2023-05-23 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact movement track
CN116740721B (en) * 2023-08-15 2023-11-17 深圳市玩瞳科技有限公司 Finger sentence searching method, device, electronic equipment and computer storage medium
CN120129935A (en) * 2023-09-28 2025-06-10 京东方科技集团股份有限公司 Character recognition method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646194A (en) * 2012-02-22 2012-08-22 大连理工大学 A Method of Printer Type Forensics Using Character Edge Features
CN106650628A (en) * 2016-11-21 2017-05-10 南京邮电大学 Fingertip detection method based on three-dimensional K curvature
CN107209563A (en) * 2014-12-02 2017-09-26 西门子公司 User interface and the method for operating system
CN107949851A (en) * 2015-09-03 2018-04-20 戈斯蒂冈有限责任公司 The quick and robust control policy of the endpoint of object in scene
CN109377834A (en) * 2018-09-27 2019-02-22 成都快眼科技有限公司 A kind of text conversion method and system of helping blind people read

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9075441B2 (en) * 2006-02-08 2015-07-07 Oblong Industries, Inc. Gesture based control using three-dimensional information extracted over an extended depth of field
US20140176689A1 (en) * 2012-12-21 2014-06-26 Samsung Electronics Co. Ltd. Apparatus and method for assisting the visually impaired in object recognition
US20160328604A1 (en) * 2014-01-07 2016-11-10 Arb Labs Inc. Systems and methods of monitoring activities at a gaming venue

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646194A (en) * 2012-02-22 2012-08-22 大连理工大学 A Method of Printer Type Forensics Using Character Edge Features
CN107209563A (en) * 2014-12-02 2017-09-26 西门子公司 User interface and the method for operating system
CN107949851A (en) * 2015-09-03 2018-04-20 戈斯蒂冈有限责任公司 The quick and robust control policy of the endpoint of object in scene
CN106650628A (en) * 2016-11-21 2017-05-10 南京邮电大学 Fingertip detection method based on three-dimensional K curvature
CN109377834A (en) * 2018-09-27 2019-02-22 成都快眼科技有限公司 A kind of text conversion method and system of helping blind people read

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FingerReader: A Wearable Deviceto Explore Printed Text on the Go;Roy Shilkrot 等;《CHI"15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems》;20150418;第2363-2372页 *
Parameter Selection of Image Fog Removal Using Artificial Fish Swarm Algorithm;Fan Guo 等;《International Conference on Intelligent Computing ICIC 2018: Intelligent Computing Theories and Application》;20180706;第167-173页 *
基于深度信息的实时手势识别和虚拟书写系统;黄晓林 等;《计算机工程与应用》;20150907;第51卷(第15期);第25-37页 *

Also Published As

Publication number Publication date
CN110458158A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110458158B (en) Text detection and identification method for assisting reading of blind people
CN112686812B (en) Bank card tilt correction detection method, device, readable storage medium and terminal
Shahab et al. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images
CN111160352B (en) Workpiece metal surface character recognition method and system based on image segmentation
Yi et al. Assistive text reading from complex background for blind persons
Plamondon et al. Online and off-line handwriting recognition: a comprehensive survey
Lu et al. Scene text extraction based on edges and support vector regression
CN115331245B (en) Table structure identification method based on image instance segmentation
CN110287952B (en) Method and system for recognizing characters of dimension picture
CN107491730A (en) A kind of laboratory test report recognition methods based on image procossing
US20110063468A1 (en) Method and apparatus for retrieving label
CN111091124B (en) Spine character recognition method
CN101266654A (en) Image text localization method and device based on connected components and support vector machines
CN105426890B (en) A kind of graphical verification code recognition methods of character distortion adhesion
CN112329779A (en) Method and related device for improving certificate identification accuracy based on mask
CN112364862B (en) Histogram similarity-based disturbance deformation Chinese character picture matching method
Park et al. Automatic detection and recognition of Korean text in outdoor signboard images
CN107480585A (en) Object detection method based on DPM algorithms
CN107122775A (en) A kind of Android mobile phone identity card character identifying method of feature based matching
Rath et al. Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques
Vo et al. Deep learning for vietnamese sign language recognition in video sequence
CN112651323B (en) Chinese handwriting recognition method and system based on text line detection
CN115995080A (en) Archive intelligent management system based on OCR (optical character recognition)
CN114332865A (en) Certificate OCR recognition method and system
CN111339932B (en) A kind of palmprint image preprocessing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220211

CF01 Termination of patent right due to non-payment of annual fee