CN102084378B

CN102084378B - Camera-based document imaging

Info

Publication number: CN102084378B
Application number: CN200980125859.2A
Authority: CN
Inventors: M·亨特; M·帕夫罗斯卡亚; L·戈登; W·蒂普顿; T·普哈姆; D·永; 顾卫青; J·埃根; 吴梁楠; K-C·旺
Original assignee: Compulink Management Center Inc
Current assignee: Compulink Management Center Inc
Priority date: 2008-05-06
Filing date: 2009-05-06
Publication date: 2014-08-27
Anticipated expiration: 2029-05-06
Also published as: WO2009137073A1; WO2009137634A1; US20140247470A1; CN102084378A; US20100073735A1; GB2472179B; GB2472179A; GB201020669D0

Abstract

Processes and systems for converting digital photographs of text documents into scan quality images are disclosed. A grid representing deformations in the image is constructed on the image by extracting the document text from the image and analyzing visual cues from the text. The image is transformed to straighten such a grid, thereby removing the distortions introduced by the camera image capture process. Variations in illumination, extraction of text line information, and modeling of curved lines in the image may all be corrected.

Description

Camera-based document imaging

对相关申请的交叉引用Cross References to Related Applications

本专利申请根据35U.S.C.119(e)请求于2008年5月6日提交的美国临时申请第61/126,781号和于2008年5月6日提交的美国临时申请第61/126,779号的优先权，这两个申请都通过引用并入于此。This patent application claims priority under 35 U.S.C. 119(e) to U.S. Provisional Application No. 61/126,781, filed May 6, 2008, and U.S. Provisional Application No. 61/126,779, filed May 6, 2008 , both applications are hereby incorporated by reference.

技术领域 technical field

本申请总体上涉及数字图像处理，尤其涉及处理照相机拍摄的图像。This application relates generally to digital image processing, and more particularly to processing images captured by cameras.

背景技术 Background technique

文档管理系统正变得越来越流行。这种系统减轻了存储和处理大型文档数据库的负担。许多机构在物理文档中存储了大量信息，为了易于管理，他们希望将这些物理文档转换成数字格式。目前，光学扫描和光学字符识别(OCR)技术的组合(例如在ABBYY-FineReaderPro 8.0中所体现的)将这些文档转换成电子形式。然而，这个过程可能是不方便的，尤其是对于如装订本或海报的媒体形式，这些形式很难快速并准确地扫描。此外，准备文档然后扫描它们的过程可能是缓慢的。Document management systems are becoming more and more popular. Such a system eases the burden of storing and processing large document databases. Many institutions store large amounts of information in physical documents that they want to convert into digital format for ease of management. Currently, a combination of optical scanning and optical character recognition (OCR) technologies, such as those embodied in ABBYY-FineReaderPro 8.0, convert these documents into electronic form. However, this process can be inconvenient, especially for media forms such as bound books or posters, which are difficult to scan quickly and accurately. Also, the process of preparing documents and then scanning them can be slow.

存储美观且只包含较小变形的图像是优选的。当图像包含严重的变形时，由于变形的影响使得这些图像更难读。而且，光学字符识别假定输入的图像不包含变形。对本申请来说，没有显著变形的文档图像在此称为是“理想的图像”。Storing images that are aesthetically pleasing and contain only minor distortions is preferable. When images contain severe distortions, the effects of the distortions make those images more difficult to read. Furthermore, OCR assumes that the input image contains no distortions. For purposes of this application, a document image without significant distortion is referred to herein as an "ideal image".

在许多情况下，现代的数码照相机具有改善数字化过程的潜能。照相机通常比扫描仪更小更容易操作。而且，文档在被照相机捕捉之前不需要太多准备。例如，海报或者标牌可以留在墙上。这种灵活性的缺陷是将缺陷引入到了图像中。照相机所捕捉的照片可能以对被扫描图像来说不存在的方式变形。最显而易见的影响是由于透视、照相机透镜、不均匀的照明条件和物理上卷曲的文档造成的变形。当前的OCR技术预期其输入来自扫描仪，因此不执行必要的预处理来处理以上提到的所捕捉文档图像中的变形。OCR技术是文档管理软件中处理图像的关键部分，因此当捕捉文档图像时由照相机引入的变形使得当前照相机不是扫描仪的满意替代。因此，展开(dewarp)照相机捕捉的文档图像并除去变形是从扫描仪过渡到照相机的必要过程。In many cases, modern digital cameras have the potential to improve the digitization process. Cameras are usually smaller and easier to handle than scanners. Also, the document does not require much preparation before being captured by the camera. For example, posters or signs can be left on the walls. The downside of this flexibility is that it introduces imperfections into the image. Photos captured by the camera may be distorted in ways that do not exist for the scanned image. The most obvious effects are distortions due to perspective, camera lenses, uneven lighting conditions, and physically curled documents. Current OCR techniques expect their input to come from a scanner and therefore do not perform the necessary pre-processing to deal with the above-mentioned distortions in the captured document image. OCR technology is a key part of processing images in document management software, so the distortion introduced by cameras when capturing document images makes current cameras not a satisfactory replacement for scanners. Therefore, dewarping the document image captured by the camera and removing the distortion is a necessary process in the transition from the scanner to the camera.

关于图像校正的大部分研究集中在特定类型的卷曲。一种使任意卷曲的文档变平的方法是将照片投影到近似原始文档表面的3D栅格中。(见2004年26(10)期IEEE Transactions on Pattern Analysis andMachine Intelligence上第1295-1306页由Michael S.Brown和W.Brent Seales所写的“Image restoration of arbitrarily warpeddocuments”。)该变平算法将栅格建模为由弹簧连接并受重力影响的质点的集合。通过让弹簧适于最小势能的状态，该算法试图使表面的拉伸最小。尽管这种方法已经证明是成功的，但它依赖于时间步长的物理建模。这种算法的实验运行时间是分钟级的，这太慢了。此外，该算法假定它具有表示文档的准确3D表面，这将必须从由2D图像提取出的信息来重构。Most research on image correction focuses on specific types of curling. One way to flatten an arbitrarily curled document is to project a photo into a 3D grid that approximates the surface of the original document. (See "Image restoration of arbitrarily warped documents" by Michael S. Brown and W. Brent Seales, pp. 1295-1306, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(10).) The flattening algorithm A lattice is modeled as a collection of mass points connected by springs and affected by gravity. The algorithm attempts to minimize the stretching of the surface by fitting the spring to a state of minimum potential energy. Although this approach has proven successful, it relies on physical modeling of time steps. The experimental run time for this algorithm is on the order of minutes, which is too slow. Furthermore, the algorithm assumes that it has an accurate 3D surface representing the document, which would have to be reconstructed from information extracted from the 2D image.

一种无需对文档表面的先前知识而展开图像的方法是基于从文档内的文本行收集到的信息在图像上建立栅格。(见2006年Proceedings of the 18th International Conference on PatternRecognition第1期上第971-974页由Shijian Lu和Chew Lim Tan所写的“Document flattening through grid modeling andregularization”。)这种方法假设在原始文档中文档行是直的而且均匀地隔开，而且每个栅格单元中的曲率是近似恒定的。每个栅格单元代表原始文档中相同大小的方形。在卷曲的图像中，栅格单元的顶侧和底侧应当平行于正切矢量，而栅格单元的左侧和右侧应当平行于法向矢量。每个四边形单元都利用线性变换映射到方形中，从而有效地展开文档。在有些情况下，这种方法缺乏确定垂直单元边界的对准与间隔所需的信息。有些人已经尝试利用“垂直笔划分析”来获得这种信息，该方法集中在将单个字符的直线片段作为文本垂直方向的标记。(见2005年Image and Vision Computing第24期上第541-553页由ShijianLu Chen、Ben M.Chen和C.C.Ko所写的“Perspective rectification ofdocument images using fuzzy set and morphological operations”。)One method of unfolding an image without prior knowledge of the document's surface is to build a grid over the image based on information gleaned from lines of text within the document. (See "Document flattening through grid modeling and regularization" by Shijian Lu and Chew Lim Tan, Proceedings of the 18th International Conference on Pattern Recognition Issue 1, 2006, pp. 971-974.) This approach assumes that in the original document The rows are straight and evenly spaced, and the curvature in each grid cell is approximately constant. Each grid cell represents a square of the same size in the original document. In a warped image, the top and bottom sides of the cell should be parallel to the tangent vector, and the left and right sides of the cell should be parallel to the normal vector. Each quadrilateral cell is mapped into a square using a linear transformation, effectively unfolding the document. In some cases, this approach lacks the information needed to determine the alignment and spacing of vertical cell boundaries. Some have attempted to obtain this information using "vertical stroke analysis," which focuses on straight-line segments of individual characters as markers of the vertical orientation of the text. (See "Perspective rectification of document images using fuzzy set and morphological operations" written by ShijianLu Chen, Ben M.Chen and C.C.Ko on pages 541-553 of Image and Vision Computing No. 24 in 2005.)

为了不利用中间栅格结构就创建连续、平滑的变换，另一种方法将页面建模为可扩展(developable)的表面。(见2005年在Pro.FirstInternational Workshop on Camera-based Document Analysis andRecognition上第25-29页由Jian Liang、Daniel DeMenthon和DavidDoermann所写的“Unwarping Images of Curved Documents UsingGlobal Shape Optimization”。)可扩展的表面是不拉伸就将平面卷曲的结果。这种方法试图通过分析文本来找到表面的划线(ruling)。划线是在平面被卷曲之前沿表面为直的线条。逆变换通过矫正划线来展开表面。In order to create continuous, smooth transitions without utilizing an intermediate grid structure, another approach models the page as a developable surface. (See "Unwarping Images of Curved Documents Using Global Shape Optimization" by Jian Liang, Daniel DeMenthon, and David Doermann, pp. 25-29, Pro. First International Workshop on Camera-based Document Analysis and Recognition, 2005.) Extensible surfaces are The result of curling a flat surface without stretching it. This approach attempts to find surface rulings by analyzing text. A scribe is a line that is straight along a surface until the plane is curled. The inverse transform unwraps the surface by correcting the scribe lines.

然而，还没有发现这些方法中任何一种对于展开采用数码照相机所捕捉的文档是完全令人满意的。However, none of these methods have been found to be entirely satisfactory for developing documents captured with digital cameras.

发明内容 Contents of the invention

本发明的一个目的是解决或者至少改善以上提到的与数字图像相关联的一个或多个问题。因此，提供了一种用于处理包含文本行的文档的拍照图像的方法，其中文本行包括具有垂直笔划的文本字符。该方法包括分析文本行的位置和形状，并将它们变直成规则的栅格，以便展开文档图像的图像。在一种实施例中，该方法包括三个主要步骤：(1)文本检测，(2)形状和朝向检测，及(3)图像变换。It is an object of the present invention to solve, or at least ameliorate, one or more of the above-mentioned problems associated with digital images. Accordingly, there is provided a method for processing a photographic image of a document containing lines of text comprising text characters having vertical strokes. The method consists of analyzing the position and shape of lines of text and straightening them into a regular grid in order to expand the image of the document image. In one embodiment, the method includes three main steps: (1) text detection, (2) shape and orientation detection, and (3) image transformation.

文本检测步骤找出图像中对应于文本的像素，并创建只包含那些像素的二元图像。这个处理通过识别局部的背景光强度来解决不可预知的照明条件。文本像素被分组成字符区域，且字符被分组成文本行。The text detection step finds the pixels in the image that correspond to text and creates a binary image containing only those pixels. This process addresses unpredictable lighting conditions by identifying local background light intensities. Text pixels are grouped into character regions, and characters are grouped into text lines.

形状和朝向检测步骤识别排版特征并确定文本的朝向。所提取出的特征是文本中对应于文本字符的顶部和底部的点(端点)及文本中垂直线条的角度(垂直笔划)。而且，曲线拟合到文本行的顶部和底部，来近似原始文档形状。The shape and orientation detection step identifies typographic features and determines the orientation of the text. The extracted features are the points in the text corresponding to the top and bottom of the text characters (endpoints) and the angles of vertical lines in the text (vertical strokes). Also, curves are fitted to the top and bottom of the text lines to approximate the original document shape.

图像变换步骤依赖于栅格建立处理，其中所提取出的特征用作识别文档卷曲的基础。生成矢量域，来表示在每个点处文档的水平和垂直拉伸。可选地，可以使用基于优化问题的方法。The image transformation step relies on a raster building process where the extracted features are used as a basis for identifying document curls. Generates a vector field representing the horizontal and vertical stretch of the document at each point. Alternatively, an optimization problem based approach can be used.

从联系附图考虑的以下描述，本发明的更多方面、目的和期望特征及优点将能更好地理解，附图中所公开发明的各种实施例是作为例子来例示的。然而，应当明确地理解，附图仅仅是为了例示，而不是要作为本发明限制的定义。Further aspects, objects and desired features and advantages of the present invention will become better understood from the following description considered in connection with the accompanying drawings, in which various embodiments of the disclosed invention are illustrated by way of example. It should be expressly understood, however, that the drawings are by way of illustration only and not as a definition of the limits of the invention.

附图说明 Description of drawings

图1是例示基于照相机的文档图像展开处理的步骤的流程图。FIG. 1 is a flowchart illustrating steps of camera-based document image development processing.

图2例示了包括包含文本行的文档的示例图像的照片。FIG. 2 illustrates a photograph including an example image of a document containing lines of text.

图3例示了在对图2的图像利用简单阈值处理进行二元化之后图2的照片的输出图像。FIG. 3 illustrates the output image of the photo of FIG. 2 after binarization of the image of FIG. 2 using simple thresholding.

图4例示了在利用Retinex型归一化然后再阈值处理进行二元化之后图2的照片的输出图像。Figure 4 illustrates the output image of the photograph of Figure 2 after binarization with Retinex-type normalization followed by thresholding.

图5例示了包含文本行的极度卷曲的文档的灰度级图像和从该文档的照片创建的其它文档。Figure 5 illustrates a grayscale image of an extremely curled document containing lines of text and other documents created from photographs of the document.

图6例示了在对图5的图像执行过滤处理之后的输出图像。FIG. 6 illustrates an output image after filtering processing is performed on the image of FIG. 5 .

图7例示了在对图6的输出图像执行粗略的阈值处理之后的输出图像。FIG. 7 illustrates an output image after rough thresholding is performed on the output image of FIG. 6 .

图8例示了在对图6的输出图像执行一种处理之后的输出图像，其中前景(最初被识别为文本的区域)被除去且插入了空像素。FIG. 8 illustrates the output image after performing a process on the output image of FIG. 6 in which the foreground (the area initially identified as text) is removed and empty pixels are inserted.

图9例示了在对图5的图像执行完整的二元化处理之后的输出图像。FIG. 9 illustrates the output image after performing a full binarization process on the image of FIG. 5 .

图10是例示英文排版中各种特征的图。FIG. 10 is a diagram illustrating various features in English typesetting.

图11例示了具有文本行的文档的拍照图像，其中控制点已经标记为暗的和亮的点。Figure 11 illustrates a photographic image of a document with lines of text, where control points have been marked as dark and light points.

图12例示了在对图11的图像执行基于优化的展开处理之后的输出图像。FIG. 12 illustrates an output image after performing an optimization-based expansion process on the image of FIG. 11 .

图13描述了用于处理被捕捉图像的系统的一个实施例。Figure 13 depicts one embodiment of a system for processing captured images.

图14是例示基于照相机的文档图像展开处理的可选实施例的步骤的流程图。Figure 14 is a flowchart illustrating the steps of an alternative embodiment of camera-based document image expansion processing.

图15是例示基于照相机的文档图像展开处理的另一个实施例的步骤的流程图。15 is a flowchart illustrating the steps of another embodiment of camera-based document image expansion processing.

具体实施方式 Detailed ways

现在将参考附图描述本发明的实施例。为了方便描述，一个图中表示元件的任何标号将在任何其它图中表示相同的元件。图1是例示根据本发明一个实施例的基于照相机的文档图像展开处理的步骤的流程图。Embodiments of the present invention will now be described with reference to the accompanying drawings. For ease of description, any reference number representing an element in one figure will represent the same element in any other figure. FIG. 1 is a flowchart illustrating the steps of a camera-based document image development process according to one embodiment of the present invention.

参考图1，提供了用于展开由照相机所捕捉的文档图像的方法100。方法100涉及分析成像文档中所包括的文本行的位置和形状，然后将它们变直成规则的栅格。在所例示的实施例中，方法100包括三个主要步骤：(1)文本检测步骤102，(2)形状和朝向检测步骤104，及(3)图像变换步骤106。如下所述，每个主要步骤可以进一步包括几个子步骤。Referring to FIG. 1 , a method 100 for developing a document image captured by a camera is provided. Method 100 involves analyzing the position and shape of lines of text included in an imaged document and then straightening them into a regular grid. In the illustrated embodiment, method 100 includes three main steps: (1) text detection step 102 , (2) shape and orientation detection step 104 , and (3) image transformation step 106 . Each major step may further include several sub-steps, as described below.

1.文本检测1. Text detection

文本检测步骤102找出图像中对应于文本的像素，并创建只包含像素的二元图像。在本实施例中，文本检测步骤102通过识别局部的背景光强度来解决不可预知的照明条件。在本实施例中，为了适合地识别文本，在文本检测步骤102中执行五个子步骤。这些子步骤是二元化步骤110、文本区域检测步骤112、文本行分组步骤114、质心样条计算步骤116和噪声去除步骤118。在其它实施例中，可以使用不同的子步骤，或者它们的次序可以变化。The text detection step 102 finds the pixels in the image that correspond to the text and creates a binary image that only contains the pixels. In this embodiment, the text detection step 102 addresses unpredictable lighting conditions by identifying local background light intensities. In this embodiment, in order to properly recognize text, five sub-steps are performed in the text detection step 102 . These sub-steps are a binarization step 110 , a text region detection step 112 , a text line grouping step 114 , a centroid spline calculation step 116 and a noise removal step 118 . In other embodiments, different sub-steps may be used, or their order may be varied.

1.1二元化1.1 Dualization

二元化110是识别图像中组成文本的像素从而将图像分成文本和非文本像素的处理。二元化的目的是通过从图像中提取关于文档形状的有用信息来定位文本并消除无关信息。这个处理以原始彩色图像作为输入。其输出是与原始图像具有相同维度的二元矩阵，其中零表示输入图像中文本的位置，而一表示其它任何地方。在其它实现中，这可以反过来。二元化处理优选地涉及(a)像素归一化，(b)阈值处理和(c)假象去除，这些中的每一个都将在下面更详细地描述。Binarization 110 is the process of identifying pixels in an image that make up text, thereby separating the image into text and non-text pixels. The purpose of binarization is to localize text and eliminate irrelevant information by extracting useful information about the shape of the document from the image. This processing takes as input a raw color image. Its output is a binary matrix of the same dimensions as the original image, where zeros represent the location of the text in the input image and ones represent anywhere else. In other implementations, this can be reversed. The binarization process preferably involves (a) pixel normalization, (b) thresholding and (c) artifact removal, each of which will be described in more detail below.

a.像素归一化a. Pixel normalization

一般来说，文本像素比其周围要暗些。简单的或粗略的二元化技术一般采用特定阈值，并假定一个图像上的所有比阈值亮的像素都是白色，而所有比阈值暗的像素都是黑色。尽管这种技术对扫描的文档工作良好，但是，由于照明和字体粗细的不同，单个全局阈值对于通过拍照文档所捕捉的各种图像将不能很好地工作。图2例示了包括文档的示例图像202的照片，其中文档包含文本行并且具有差的成像质量。注意，由于原始文档的卷曲，相比于图像202的其余部分，在图像202的右上区域204上照明要暗些。图3例示了在利用简单阈值处理对图2的图像202进行二元化之后图2的照片的输出图像206。注意，图像202的整个右上区域208都被认为是文本区域。In general, text pixels are darker than their surroundings. Simple or crude binarization techniques generally take a specific threshold and assume that all pixels on an image that are brighter than the threshold are white and all pixels that are darker than the threshold are black. While this technique works well for scanned documents, a single global threshold will not work well for the variety of images captured by photographing documents due to differences in lighting and font weight. FIG. 2 illustrates a photograph including an example image 202 of a document, where the document contains lines of text and has poor imaging quality. Note that the illumination is darker in the upper right region 204 of the image 202 compared to the rest of the image 202 due to the curling of the original document. FIG. 3 illustrates an output image 206 of the photograph of FIG. 2 after binarizing the image 202 of FIG. 2 using simple thresholding. Note that the entire upper right region 208 of the image 202 is considered a text region.

为了解决这种强度变化，在一个实施例中，可以基于与周围相比的相对强度来对每个像素执行归一化运算。在这个方面，可以采用来自Retinex的方法。(见2007年http://dragon.larc.nasa.gov/上GlennWoodell所写的“Retinex image processing”。)根据Retinex，原始图像被分成块，这些块大到足以包含几个文本字符，但又小到足以具有比页面作为一个整体更一致的照明。因为，在一般的文档中，通常有比背景像素少的文本像素，所以块中的中值将近似地是特定块中背景页面的强度值。然后，每个像素值可以除以块的中值，以获得归一化值。To account for this variation in intensity, in one embodiment, a normalization operation may be performed on each pixel based on its relative intensity compared to its surroundings. In this regard, the method from Retinex can be used. (See "Retinex image processing" by Glenn Woodell, http://dragon.larc.nasa.gov/, 2007.) According to Retinex, the original image is divided into blocks large enough to contain a few text characters, but Small enough to have more consistent lighting than the page as a whole. Since, in a general document, there are usually fewer text pixels than background pixels, the median value in a block will be approximately the intensity value of the background page in a particular block. Each pixel value can then be divided by the median of the block to obtain a normalized value.

应当理解，块的尺寸可以调整，且可以采用多种块尺寸。例如，如果块的尺寸太大，则由于页面上不均匀的照明，块的中值可能不能准确地表示背景。另一方面，如果相比于文本字符的尺寸来说块尺寸太小，则中值会错误地表示文本强度，而不是表示背景强度。此外，由于文档页面上变化的条件，所以单个块尺寸可能不适于整个图像。例如，标题中的文本字符常常比较大，因此需要更大的块尺寸。It should be understood that the size of the blocks can be adjusted and that a variety of block sizes can be used. For example, if the size of the block is too large, the median of the block may not accurately represent the background due to uneven lighting on the page. On the other hand, if the block size is too small compared to the size of the text characters, the median will incorrectly represent the text intensity instead of the background intensity. Also, due to varying conditions on a document page, a single block size may not fit the entire image. For example, text characters in titles are often large and thus require larger block sizes.

用于确定可以采用的适当块尺寸的一种处理是通过取整个图像并将其分成许多非常小的块来进行的。然后再将块逐步地进行重新组合。在重新组合的每个层次，都评估当前的块是否大到足以使用。重新组合处理可以在页面上的不同点处停止。块尺寸是否“足够大”可以基于附加的试探。例如，因为非零的拉普拉斯算子与文档中文本的位置有非常高的相关性，所以可以对输入图像施加离散二阶导数或者拉普拉斯算子的应用。因此，将块的尺寸设定成包含特定量的求和拉普拉斯算子可以确保块足够大，以包含几个文本字符。One process for determining the appropriate block size to use is by taking the entire image and dividing it into many very small blocks. The blocks are then reassembled step by step. At each level of reassembly, it is evaluated whether the current block is large enough to be used. Recomposition processing can stop at various points on the page. Whether the block size is "big enough" may be based on additional heuristics. For example, since a non-zero Laplacian has a very high correlation with the position of text in a document, a discrete second derivative or application of the Laplacian can be applied to the input image. Therefore, dimensioning the block to contain a certain amount of summed Laplacian ensures that the block is large enough to contain a few text characters.

应当理解，对于特定的应用(例如，照相机类型、文档类型、照明，等等)，以上所述用于确定块是否足够大以进行归一化的方法可以进行细微调整。It should be appreciated that the methods described above for determining whether a block is large enough to be normalized can be fine-tuned for a particular application (eg, camera type, document type, lighting, etc.).

b.阈值处理b. Thresholding

如前所述，当像素相对于背景纸张颜色进行归一化之后，背景上的像素将具有大约为一的归一化值，而文本上的像素具有低得多的归一化值。因此，这种比较将不受图像的绝对亮度或者暗度影响。由于对像素的归一化运算可以通过只利用其局部环境来执行，因此它还独立于跨页面照明中的局部变化。As mentioned earlier, when pixels are normalized with respect to the background paper color, the pixels on the background will have a normalized value of about one, while the pixels on the text will have a much lower normalized value. Therefore, this comparison will not be affected by the absolute lightness or darkness of the image. Since the normalization operation on a pixel can be performed by exploiting only its local environment, it is also independent of local variations in lighting across pages.

为了区分白色值和黑色值，选择阈值。然而，由于单个图像的强度特性已经通过如上所述的归一化被滤出了，因此单个阈值能够对大部分图像一致地工作。而且，因为归一化的背景具有大约为一的像素值，所以在一个实施例中，选择稍微低于一的阈值，例如0.90或者0.95。在其它实施例中，构想还可以采用其它适合的阈值，以及不同的块可以采用不同的值。To differentiate between white and black values, choose a threshold. However, since the intensity features of individual images are already filtered out by normalization as described above, a single threshold works consistently for most images. Also, because the normalized background has pixel values around one, in one embodiment, a threshold value slightly below one is chosen, such as 0.90 or 0.95. In other embodiments, it is contemplated that other suitable thresholds may be used, and that different blocks may use different values.

图4例示了当利用局部归一化的二元化之后对图2中所例示的非理想图像执行根据本发明的阈值处理时所得的输出图像。当与图3中所例示的简单二元化的结果比较时，可以观察到显著的改善。在图4中，右上区域中的文本行212现在可以与背景214区分开。FIG. 4 illustrates an output image obtained when thresholding according to the present invention is performed on the non-ideal image illustrated in FIG. 2 after binarization with local normalization. When compared to the results of simple binarization illustrated in Figure 3, a significant improvement can be observed. In FIG. 4 , the text line 212 in the upper right area can now be distinguished from the background 214 .

c.假象去除c. Artifact removal

如图4所示，在许多情况下，阈值处理之后的图像中将存在假象或者噪声。这个阶段的目的就是识别并除去假的正值或者噪声。例如，相对于其周围，纸的边缘趋于薄且暗。当特定块不包含文本时，背景中也可能存在噪声。这种噪声(例如，包括由于照明像差导致的噪声)可能会被识别成文本。因此，优选地采用附加的后处理来除去噪声。As shown in FIG. 4, in many cases, there will be artifacts or noise in the image after thresholding. The purpose of this stage is to identify and remove false positives or noise. For example, the edges of paper tend to be thin and dark relative to their surroundings. There may also be noise in the background when a particular block does not contain text. Such noise (eg, including noise due to illumination aberrations) may be recognized as text. Therefore, additional post-processing is preferably employed to remove noise.

一种用于除去噪声的处理将二元化的图像中的黑色或者文本像素分离为连接的部分。采用三个标准来丢弃非文本的所连接区域。前两个标准用于根据像素的个数来检查区域“太大”还是“太小”。第三个标准是基于如果一个区域完全包括接近第一阈值的像素，则该区域有可能是噪声的观察。实际的文本字符或称字符可能具有一些边界线像素，但其大部分应当暗得多。因此，整个区域的平均归一化值可以被检查，且其平均归一化值太高的区域应当被除去。这些标准引入了三个参数：最小区域面积、最大区域面积和用于区域级(region-wise)的平均像素值的阈值。区域级的阈值应当比像素级(pixel-wise)阈值低(更严格)，以对除去噪声具有期望的效果。A process for removing noise separates black or text pixels in the binarized image into connected parts. Three criteria were used to discard non-text connected regions. The first two criteria are used to check whether a region is "too big" or "too small" in terms of the number of pixels. The third criterion is based on the observation that a region is likely to be noisy if it consists entirely of pixels close to the first threshold. Actual text characters or glyphs may have some borderline pixels, but most of them should be much darker. Therefore, the average normalized value of the whole region can be checked, and regions whose average normalized value is too high should be removed. These criteria introduce three parameters: a minimum region area, a maximum region area and a threshold for region-wise average pixel values. Region-wise thresholds should be lower (stricter) than pixel-wise thresholds to have the desired effect on noise removal.

在以上所述的二元化处理的像素归一化步骤中，进行对背景纸张颜色的估计，然后，如果像素比该颜色暗得多，则该像素被识别为文本，而且图像被分成块，假定每个块中的中值颜色作为其背景纸张颜色。假如能很好地选择先前所提到的参数，则该方法工作得很好。然而，构成良好选择的参数有时候会从一个图像到另一个图像，或者甚至从图像的一部分到另一部分，剧烈地变化。为了避免这些问题，可以采用以下所述的可选二元化处理。In the pixel normalization step of the binarization process described above, an estimate of the background paper color is made, then, if a pixel is much darker than that color, the pixel is identified as text and the image is divided into blocks, Assume the median color in each block as its background paper color. This method works well provided the previously mentioned parameters are well chosen. However, the parameters that make up a good choice sometimes vary drastically from one image to another, or even from one part of an image to another. To avoid these problems, the optional binarization described below can be employed.

可选地，在本实施例中，二元化步骤110可以通过执行以下优选步骤来进行。首先，由粗略阈值处理方法对前景进行粗略的估计。用于这种粗略阈值处理的参数被选择成使得我们宁可将太多的像素识别为文本。然后，根据所选的阈值，从原始图像中除去这些前景像素。然后，由于除去前景像素而留下的洞通过从剩余值进行插值来填充。通过除去初始阈值并在洞上插值，这提供了对背景的新估计。最后，现在阈值处理可以根据对背景的改进估计来进行。甚至当在拍照文档上给出不均匀的照明条件时，这个处理也工作得很好。以下提供对如何执行这种优选的二元化步骤110的更详细描述。Optionally, in this embodiment, the binarization step 110 can be performed by performing the following preferred steps. First, the foreground is roughly estimated by a rough thresholding method. The parameters for this coarse thresholding are chosen such that we rather identify too many pixels as text. These foreground pixels are then removed from the original image according to the chosen threshold. Holes left by the removal of foreground pixels are then filled by interpolating from the remaining values. This provides a new estimate of the background by removing the initial threshold and interpolating over the holes. Finally, thresholding can now be performed based on an improved estimate of the background. This process works well even when given uneven lighting conditions on photographic documents. A more detailed description of how this preferred binarization step 110 is performed is provided below.

首先，包括文本行的文档的照片被转换成灰度级图像216，如图5所示。灰度级图像216包括包含文本行的文档的示例图像，其中极度卷曲的主文档218与其它文档220一起示出。在一个实施例中，转换成灰度级可以通过使用Matlab的rgb2gray函数来实现。First, a photograph of a document including lines of text is converted into a grayscale image 216, as shown in FIG. Grayscale image 216 includes an example image of a document containing lines of text, with an extremely curled main document 218 shown along with other documents 220 . In one embodiment, the conversion to gray scale can be achieved by using the rgb2gray function of Matlab.

第二，对图像进行预处理，以便降低噪声，由此平滑所捕捉到的图像。在一个实施例中，平滑可以通过使用Wiener过滤器来进行，其中Wiener过滤器是低通过滤器。图6所示的图像222例示了在对图5的图像执行过滤处理之后的输出图像。尽管图6所示的图像222看起来就象图5所示的其输入图像216，但过滤器很好地除去了椒盐型噪声。Wiener过滤器可以通过例如使用具有3×3邻域的Matlab的weiner2函数来执行。Second, the image is preprocessed to reduce noise, thereby smoothing the captured image. In one embodiment, smoothing can be performed by using a Wiener filter, where the Wiener filter is a low pass filter. An image 222 shown in FIG. 6 illustrates an output image after filtering processing is performed on the image of FIG. 5 . Although the image 222 shown in FIG. 6 looks just like its input image 216 shown in FIG. 5, the filter does a good job of removing salt and pepper noise. The Wiener filter can be performed, for example, using Matlab's weiner2 function with a 3x3 neighborhood.

第三，前景是通过使用简单的或者粗略的阈值处理来估计的。在本实施例中，该方法归属于Sauvola，该方法计算关于每个像素的邻域中像素值的平均和标准偏差，并使用该数据来决定每个像素是否足够暗到象文本。(见2000年Pattern Recognition第33卷上第225-236页由J.Sauvola和M.Pietikainen所写的“Adaptive Document ImageBinarization”，该文献通过引用并入于此。)图7例示了在对图6的输出图像222执行粗略阈值处理之后的输出图像224。在其它实施例中，也可以使用诸如Niblack的方法。(见1985年Prentice HallInternational的Section 5.1上第113-117页由Wayne Niblack所写的“An Introduction to Digital Image Processing”，该文献通过引用并入于此。)Third, the foreground is estimated by using simple or coarse thresholding. In this embodiment, the method attributed to Sauvola calculates the mean and standard deviation of the pixel values in the neighborhood for each pixel and uses this data to decide whether each pixel is dark enough to resemble text. (See "Adaptive Document Image Binarization" by J. Sauvola and M. Pietikainen, Pattern Recognition, Vol. 33, pp. 225-236, 2000, which is hereby incorporated by reference.) Fig. The output image 224 of the output image 222 after performing rough thresholding. In other embodiments, methods such as Niblack may also be used. (See "An Introduction to Digital Image Processing" by Wayne Niblack, Section 5.1, Prentice Hall International, 1985, pp. 113-117, which is hereby incorporated by reference.)

在象页面226的顶部的区域(其中标准偏差非常小)中，输出大部分是噪声。这是窗口大小为什么重要的一个原因。当对比度明显时，例如围绕纸张的边缘228，也出现噪声。然而，噪声假象的存在是不重要的，因为噪声假象可以在后一个阶段除去。在本实施例中，选择大量的假正数，而不是假负数，因为如果没有假负数，以下步骤工作得最好。In areas like the top of page 226 (where the standard deviation is very small), the output is mostly noise. This is one reason why window size matters. Noise also occurs when the contrast is significant, such as around the edge 228 of the paper. However, the presence of noise artifacts is unimportant since noise artifacts can be removed at a later stage. In this example, a large number of false positives was chosen rather than false negatives because the following steps work best without false negatives.

第四，通过首先经初始阈值处理来除去前景(最初被识别为文本的区域)然后在由于前景去除而造成的洞上插值，可以找到背景。对于经初始阈值处理被识别为文本的那些像素，它们的颜色值被通过插入邻近像素的颜色值来替代以近似背景，如图8中的图像230中所示。图8例示了在对图7的输出图像224执行一种处理之后的输出图像230，其中前景被除去且已插入空像素。这个图像230可以包含来自文本假象的噪声，因为有些围绕文本的较暗像素在初始阈值处理步骤中可能没有被识别为文本。这个效果是当估计背景时在初始阈值处理步骤中使用前景的更大超集的另一个原因。Fourth, the background can be found by first removing the foreground (areas initially identified as text) by initial thresholding and then interpolating over the holes caused by the foreground removal. For those pixels identified as text by initial thresholding, their color values are replaced by interpolating the color values of neighboring pixels to approximate the background, as shown in image 230 in FIG. 8 . FIG. 8 illustrates the output image 230 after performing a process on the output image 224 of FIG. 7 in which the foreground has been removed and null pixels have been inserted. This image 230 may contain noise from text artifacts, since some darker pixels surrounding the text may not have been identified as text in the initial thresholding step. This effect is another reason to use a larger superset of the foreground in the initial thresholding step when estimating the background.

最后，阈值处理是基于图8中所估计的背景图像230执行的。在一个实施例中，图7的经预处理的输出图像224和图8的背景图像230之间的比较是由Gatos的方法执行的。(见2006年Pattern Recognition第39卷上第317-327页由B.Gatos、I.Pratikakis和S.J.Perantonis所写的“Adaptive Degraded Document Image Binarization”，该文献通过引用并入于此。)图9例示了在对图5的图像216执行了完整的二元化处理之后的输出图像240。在图9中，甚至在靠近主文档248的边缘246处的极度卷曲区域，文本区域242也从其背景244中很好地被识别出。Finally, thresholding is performed based on the estimated background image 230 in FIG. 8 . In one embodiment, the comparison between the preprocessed output image 224 of FIG. 7 and the background image 230 of FIG. 8 is performed by Gatos' method. (See "Adaptive Degraded Document Image Binarization" by B. Gatos, I. Pratikakis, and S.J. Perantonis, Pattern Recognition, Vol. 39, pp. 317-327, 2006, which is hereby incorporated by reference.) Figure 9 illustrates The output image 240 after performing a full binarization process on the image 216 of FIG. 5 is shown. In FIG. 9 , text region 242 is well recognized from its background 244 even in an extremely curled region near edge 246 of master document 248 .

在后一个阶段，可以执行后处理。阈值可以应用到最大和最小的区域，而且可以除去噪声的普通实例(例如，围绕主文档248的边缘的大暗线250)。At a later stage, post-processing can be performed. Thresholding can be applied to the largest and smallest areas, and can remove common instances of noise (eg, large dark lines 250 around the edges of the main document 248).

因此，先前关于图5-9所述的二元化步骤110能够处理作为输入的在差照明条件下被捕捉的极度卷曲文档218的照片图像，并成功地将其转换成该文档的二元化图像240，其中文本区域与其背景可以区分开。Thus, the binarization step 110 previously described with respect to FIGS. Image 240, where the text area is distinguishable from its background.

1.2文本区域检测1.2 Text area detection

在提取出图像中的文本像素的位置后，可以识别原始文档的有用特征，尤其是局部的水平和垂直文本朝向。然后，可以建立矢量域，来建模文档的文本流。应当指出，在图像中，水平和垂直数据是分开的。尽管这些方向在源文档中是正交的，但是透视变换去掉了它们的耦合。在具有清楚文本特征的位置的这些朝向可以被识别，且跨页的朝向可以被插入，以便描述整个文档的表面。After extracting the position of text pixels in the image, useful features of the original document can be identified, especially local horizontal and vertical text orientations. Then, a vector field can be built to model the text flow of a document. It should be noted that in images, horizontal and vertical data are separated. Although these directions are orthogonal in the source document, the perspective transformation decouples them. These orientations at locations with clear text features can be identified, and orientations across pages can be interpolated to describe the surface of the entire document.

参考图10，使用拉丁字符集的语言具有大量包括一个或多个长、直、垂直线条的字符，其中长、直、垂直线条称为垂直笔划260。有相对少的类似长度的对角线，而且它们常常与邻近的垂直笔划有显著的角度。这种规律使得垂直笔划成为获得关于页面的垂直方向的信息的理想文本特征。Referring to FIG. 10 , languages using Latin character sets have a large number of characters including one or more long, straight, vertical lines called vertical strokes 260 . There are relatively few diagonal lines of similar length, and they often have significant angles to adjacent vertical strokes. This regularity makes vertical strokes an ideal text feature to obtain information about the vertical orientation of the page.

为了找到页面的水平方向，可以使用单个文本行中的平行水平线集，称为划线(ruling)。不象垂直笔划260，这些划线本身在源文档中是看不到的。一般来说，字符的顶部和底部落在称为x高度262和基线264的两个主要划线上。x高度262和基线264划线分别定义文本字符x的顶部和底部。在有些文本字符中，文本字符的一部分延伸超过了文本字符x的高度，就象d和h，称为上行字母266。另一方面，下行字母268是指文本字符的一部分延伸低于文本字符x的底部，就象y或者q。在本实施例中，x高度262和基线264用作字符区域的局部最大值和最小值(端点)。这些端点是字符区域中的“最高”和“最低”像素，其中用于高和低的方向是从通过文本行中每个字符区域的质心的粗略样条确定的。这些端点随后用在曲线拟合处理中，这将在单独的章节中描述。To find the horizontal orientation of a page, a set of parallel horizontal lines in a single line of text, called rulings, can be used. Unlike vertical strokes 260, these strokes themselves are not visible in the source document. Generally, the top and bottom of the characters fall on two main lines called x-height 262 and baseline 264 . The x-height 262 and baseline 264 dashes define the top and bottom of the text character x, respectively. In some text characters, a portion of the text character extends beyond the height of the text character x, like d and h, called ascenders 266 . A descender 268, on the other hand, refers to a portion of a text character that extends below the bottom of a text character x, like a y or q. In this embodiment, x-height 262 and baseline 264 are used as local maxima and minima (endpoints) for the character region. These endpoints are the "highest" and "lowest" pixels in the character region, where the directions for high and low are determined from a rough spline through the centroid of each character region in the text line. These endpoints are then used in the curve fitting process, which is described in a separate section.

如果两个像素具有相同的颜色，而且彼此相邻并共享一个公共的侧边，则这两个像素相连接。像素区域是一组连接的黑像素。在这个专利文档中，术语“连接的部分”、“连接的区域”或者仅仅“字符区域”都可以互换使用。Two pixels are connected if they have the same color, are adjacent to each other and share a common side. A pixel region is a group of connected black pixels. In this patent document, the terms "connected portion", "connected region" or just "character region" are all used interchangeably.

正确二元化后的图像应当包括一组连接的区域，假定每个区域对应于可以旋转或者歪斜但没有明显局部弯曲的单个文本字符。文本区域检测步骤112将在前面二元化步骤中被识别为文本像素的所有像素都组织到连接的像素区域中。在二元化步骤成功的情况下——二元化后的图像具有低噪声且文本字符被很好地分解——每个文本字符都应当识别为连接的区域。然而，存在其中文本字符组被标记为邻接区域的情况。A properly binarized image should consist of a set of connected regions, each of which is assumed to correspond to a single text character that can be rotated or skewed without significant local curvature. The text region detection step 112 organizes all pixels identified as text pixels in the previous binarization step into connected pixel regions. In cases where the binarization step is successful—the binarized image has low noise and the text characters are well decomposed—each text character should be recognized as a connected region. However, there are cases where groups of text characters are marked as contiguous regions.

在本实施例中，可以采用Matlab的内建区域查找算法(该算法是标准的广度优先搜索(breadth-first search)算法)来实现文本区域检测步骤112并识别字符区域。In this embodiment, Matlab's built-in region search algorithm (this algorithm is a standard breadth-first search algorithm) can be used to implement the text region detection step 112 and identify character regions.

1.3文本行分组1.3 Text line grouping

文本行分组步骤114用于将图像中的字符区域分组成文本行。对文本方向的估计是基于二元图像的局部投影轮廓和在分组处理中产生的可用文本方向进行的。优先权给予具有同线字符的组。当找到更好的可能性时，允许重新形成组。换句话说，可以利用猜测和检验(guess-and-check)算法来将字符分组成文本行，该算法基于接近程度分组区域并基于线性度覆盖先前的组。对于每个文本行，通过拟合通过字符质心的粗略多项式，可以找到对局部朝向的初始估计。多项式拟合优选地强调性能多于精度，因为后续步骤需要这个估计，但不需要其非常准确。多项式拟合的正切用于初始的水平朝向估计，而且初始垂直朝向假定为优选地是正交的。The text line grouping step 114 is used to group character regions in the image into text lines. The estimation of the text orientation is based on the local projected contours of the binary image and the available text orientations produced in the grouping process. Priority is given to groups with characters on the same line. Groups are allowed to reform when better possibilities are found. In other words, characters can be grouped into text lines using a guess-and-check algorithm that groups regions based on proximity and overwrites previous groups based on linearity. For each text line, an initial estimate of the local orientation is found by fitting a rough polynomial through the character centroids. Polynomial fitting preferably emphasizes performance over accuracy, since subsequent steps require this estimate, but do not need it to be very accurate. The tangent of the polynomial fit is used for the initial horizontal orientation estimate, and the initial vertical orientation is assumed to be preferably orthogonal.

1.4质心样条计算1.4 Centroid spline calculation

在质心样条计算步骤116中，计算文本行的每个字符区域的“质心”的位置。在本实施例中，质心是字符区域中每个像素的坐标的平均值。然后，计算通过这些质心坐标的样条。In the centroid spline calculation step 116, the position of the "centroid" of each character region of the text line is calculated. In this embodiment, the centroid is the average value of the coordinates of each pixel in the character area. Then, compute a spline through these centroid coordinates.

1.5噪声去除1.5 Noise removal

在将字符区域分组成文本行之后，所计算出的样条的位置可以用于确定哪些文本行不对应于真正的文本。这些是由来自不对应于真正文本行的页边界之外的背景噪声的无关像素组成的字符区域分组。在本实施例中，在这个噪声去除步骤118中基于照片/列除去噪声。After grouping character regions into text lines, the calculated positions of the splines can be used to determine which text lines do not correspond to real text. These are groupings of character regions consisting of extraneous pixels from background noise outside the page boundaries that do not correspond to true text lines. In this embodiment, noise is removed on a photo/column basis in this noise removal step 118 .

因为文本可以分组成段落，所以可以识别对应于段落的区域。因此，表示不与段落区域相交的文本行的样条可以作为噪声而不是真正的文本行来对待，因此应当除去。Because text can be grouped into paragraphs, regions corresponding to paragraphs can be identified. Therefore, splines representing lines of text that do not intersect paragraph regions may be treated as noise rather than true lines of text and should therefore be removed.

为了识别对应于段落的区域，可以假定在段落中文本行与紧挨着的上面或者下面的文本行平行，而且这些文本行具有大致相同的形状和大小。附加地，还可以假定文本行之间的垂直距离是恒定的。In order to identify a region corresponding to a paragraph, it may be assumed that in a paragraph a line of text is parallel to an immediately above or below a line of text, and that these lines of text have approximately the same shape and size. Additionally, it can also be assumed that the vertical distance between lines of text is constant.

因此，可以通过使用膨胀与腐蚀过滤器来识别包含段落的多边形区域。膨胀过滤器扩展像素区域的边界，而腐蚀过滤器收缩像素区域的边界。这些过滤器使用不同的结构元素来精确地定义过滤器如何影响区域的边界。圆可以用作结构元素，它通过圆的半径来扩展和收缩区域。Therefore, polygonal regions containing paragraphs can be identified by using dilate and erode filters. Dilation filters expand the boundaries of pixel regions, while erosion filters shrink the boundaries of pixel regions. These filters use different structural elements to define precisely how the filter affects the boundaries of the region. Circles can be used as structural elements, expanding and contracting areas by the radius of the circle.

在本实施例中，噪声去除步骤118优选地是按以下顺序执行的。首先，基于文本行之间的距离，确定结构元素的大小。通过扩展文本行距离，可以形成区域，使得每对相邻的文本行都包含在单个区域中，由此有效地将段落放在区域中。接下来，可以采用腐蚀过滤器来加倍文本行距离，以便从主段落消除稀或者远的区域。然后，膨胀过滤器可以用于确保剩余区域包围了对应的段落。接下来，其面积小于最大区域面积的预定因子的所有区域都可以丢弃，以便除去剩余的噪声区域。在一个实施例中，预定因子是四分之一。一旦识别出包含段落的区域，不与这些区域相交的所有样条就都可以除去，由此只留下对应于真正文本行的切片线条。In this embodiment, the noise removal step 118 is preferably performed in the following order. First, the structuring element is sized based on the distance between lines of text. By extending the text line distance, regions can be formed such that each pair of adjacent text lines is contained within a single region, effectively placing paragraphs within a region. Next, an erosion filter can be employed to double the text line distance to eliminate thin or distant regions from the main paragraph. A dilation filter can then be used to ensure that the remaining regions surround the corresponding paragraphs. Next, all regions whose area is smaller than a predetermined factor of the largest region's area can be discarded in order to remove the remaining noisy regions. In one embodiment, the predetermined factor is one quarter. Once the regions containing paragraphs are identified, all splines that do not intersect these regions can be removed, thus leaving only slice lines corresponding to true text lines.

尽管以上所述的去除处理可能偶然地除去有效的文本行(例如，标题和脚标)，但段落应当包含足够多关于页面形状的信息，用于进一步的处理。While the removal process described above may occasionally remove valid lines of text (eg, headings and footers), paragraphs should contain enough information about the shape of the page for further processing.

2.形状与朝向检测2. Shape and orientation detection

形状与朝向检测步骤104识别排版特征并确定文本的朝向。所识别出的特征是文本中对应于文本字符的顶部和底部的点(端点)以及文本中垂直线条(垂直笔划)的角度。这些特征可能不会在每单个字符中存在。例如，大写的O就既没有垂直笔划也没有x高度端点。而且，曲线拟合到文本行的顶部和底部，以便近似原始的文档形状。The shape and orientation detection step 104 identifies typographic features and determines the orientation of the text. The identified features are the points in the text corresponding to the tops and bottoms of the text characters (endpoints) and the angles of vertical lines (vertical strokes) in the text. These characteristics may not be present in every single character. For example, a capital O has neither a vertical stroke nor an x-height endpoint. Also, curves are fitted to the top and bottom of the lines of text to approximate the original document shape.

在本实施例中，在形状与朝向检测步骤104中执行五个子步骤。这些子步骤是端点检测步骤120、样条拟合步骤122、页面朝向检测步骤124、异常值(outliner)去除和垂直段落边界确定步骤126及垂直笔划检测步骤128。In this embodiment, five sub-steps are performed in the shape and orientation detection step 104 . These sub-steps are endpoint detection step 120 , spline fitting step 122 , page orientation detection step 124 , outliner removal and vertical paragraph boundary determination step 126 and vertical stroke detection step 128 .

2.1端点检测2.1 Endpoint detection

如前面所提到的，字符的端点是字符中的顶部和底部特征，使得它们在所识别出的字符区域中是局部最小值或者最大值。它们趋于落在文本行的水平划线上。在本实施例中，端点检测步骤120用于找出文本文档中的水平朝向，因为端点是字符区域中明确定义的特征。端点可以按照每个字符从阈值化的字符区域和文本行的质心样条来识别。As mentioned earlier, the endpoints of a character are the top and bottom features in the character such that they are local minima or maxima in the identified character region. They tend to fall on the horizontal dashes of lines of text. In this embodiment, the endpoint detection step 120 is used to find out the horizontal orientation in the text document, since endpoints are well-defined features in character regions. Endpoints can be identified per character from thresholded character regions and centroid splines of text lines.

为了找到所识别出的字符区域中的局部最大值和最小值，关于字符区域的朝向是关于找到最大值和最小值的朝向定义的。这个朝向可以由通过字符的质心样条的角度来近似。这种近似会有高误差，因为字符区域中的端点关于所选原始朝向是鲁棒的。对于在垂直笔划的顶部和底部的端点，需要多达90°的字符朝向误差来错误地识别端点。如果字符朝向具有多达40°的误差，对角线笔划顶部的端点仍然可以准确地被识别。位于曲线字符(例如，文本字符“o”)顶部的端点对朝向的误差更敏感，因为即使几度的小误差都会将端点放到曲线的不同位置。然而，这种误差不会将所识别出端点的高度改变超出几个像素。In order to find the local maxima and minima in the identified character regions, an orientation with respect to the character region is defined with respect to the orientation in which the maxima and minima are found. This orientation can be approximated by the angle of the spline through the character's centroid. This approximation suffers from high error because the endpoints in the character region are robust with respect to the chosen original orientation. For endpoints at the top and bottom of a vertical stroke, as much as 90° of character orientation error is required to misidentify the endpoints. If the character orientation has an error of as much as 40°, the endpoints at the top of the diagonal strokes can still be accurately identified. Endpoints at the top of a curve character (eg, the text character "o") are more sensitive to errors in orientation, since even a small error of a few degrees can place the endpoints in a different position on the curve. However, this error does not change the height of the identified endpoints by more than a few pixels.

在找到端点之前，应当知道近似的朝向。可以对每个区域的像素执行坐标的改变，其中新的y坐标y’是由朝向给出的，而新的x坐标x’与y’方向正交。这可以通过对像素坐标列表应用旋转矩阵来实现。换句话说，与原始的整数坐标相对，新的像素坐标是由浮点数表示的。x’坐标可以取整到最近的整数，以便将像素分组成旋转后空间中的列。The approximate orientation should be known before the endpoint is found. A change of coordinates can be performed on each region's pixels, where the new y-coordinate y' is given by the orientation, and the new x-coordinate x' is orthogonal to the y' direction. This can be achieved by applying a rotation matrix to the list of pixel coordinates. In other words, the new pixel coordinates are represented by floating-point numbers as opposed to the original integer coordinates. The x' coordinate can be rounded to the nearest integer in order to group pixels into columns in rotated space.

为了找到字符区域中的全局极值，应当识别具有最大或最小y’坐标的像素。显著较大部分的全局极值落在如图10所示的大写字母高度线270上，使得如果只考虑全局极值则难以准确地区分任何一个划线。另一方面，找到字符区域中的局部极值通常会产生更好的结果。大部分局部最大值在x高度划线上，使得划线很容易找到。In order to find the global extrema in the character region, the pixel with the largest or smallest y' coordinate should be identified. A significantly larger portion of the global extrema falls on the cap height line 270 as shown in FIG. 10 , making it difficult to accurately distinguish any one dash if only the global extrema is considered. On the other hand, finding local extrema in character regions usually yields better results. Most of the local maxima are on the x-height dash, making the dash easy to find.

为了将顶部的端点与底部的端点分开，字符区域可以首先沿质心样条分成两半。只有在该质心样条之上的点才有可能是位于x高度划线上的局部最大值。而且只有在该质心样条之下的点才有可能是位于基线划线上的局部最小值。在每一半中，局部极值都是由迭代处理识别的，该迭代处理选择当前的全局极值并除去附近的像素，如下一段中更详细描述的。To separate the top endpoint from the bottom endpoint, the character region can first be split in half along the centroid spline. Only points above this centroid spline are likely to be local maxima lying on the x-height dash. And only points below this centroid spline are likely to be local minima lying on the baseline dash. In each half, local extrema are identified by an iterative process that selects the current global extrema and removes nearby pixels, as described in more detail in the next paragraph.

从识别出的端点开始，迭代处理找出相邻两个像素列中不高于端点本身的最高像素，然后删除端点列中的其它一切。然后，对邻近列中的像素迭代，将该列的顶部作为用于进行除去的另一个端点。以这种方式，来自字符朝向方向中字符的像素可以被除去，由此保留其它局部极值。然后重复该处理，在更小的像素集中使用新的全局极值作为新的端点。Starting from the identified endpoint, the iterative process finds the highest pixel in two adjacent columns of pixels that is not higher than the endpoint itself, and then deletes everything else in the endpoint column. Then, iterate over the pixels in an adjacent column, using the top of that column as the other endpoint for removal. In this way, pixels from characters in the character heading direction can be removed, thereby preserving other local extrema. The process is then repeated, using new global extrema as new endpoints in smaller sets of pixels.

2.2样条拟合2.2 Spline fitting

在样条拟合步骤122中，样条拟合到文本行的顶部和底部。在获得前一章节中所描述的端点后，端点可以被过滤，且样条可以拟合到端点。样条用于建模每个文本行的基线264和x高度262划线，用于指示文档的局部卷曲。In a spline fitting step 122, splines are fitted to the top and bottom of the text line. After obtaining the endpoints as described in the previous section, the endpoints can be filtered and splines can be fitted to the endpoints. Splines are used to model the baseline 264 and x-height 262 strokes of each text line to indicate localized curling of the document.

样条可以用于以类似于高阶多项式的方式平滑地近似数据，同时避免与多项式相关联的问题，例如Runge现象。(见2007年http://demonstrations.wolfram.com/RungesPhenomenon上由ChrisMaes所写的“Runge’s Phenomenon”，该文献通过引用并入于此。)在本实施例中，样条是分段的三次多项式，在该多项式段相遇的坐标处有连续的导数。在本实施例中，如果期望拟合误差的减小，则要增加多项式段的个数，而不是增加多项式的次数。Splines can be used to smoothly approximate data in a manner similar to higher-order polynomials, while avoiding problems associated with polynomials, such as the Runge phenomenon. (See "Runge's Phenomenon" by Chris Maes at http://demonstrations.wolfram.com/RungesPhenomenon, 2007, which is hereby incorporated by reference.) In this example, the spline is a piecewise cubic polynomial , with continuous derivatives at the coordinates where the polynomial segments meet. In this embodiment, if it is desired to reduce the fitting error, it is necessary to increase the number of polynomial segments instead of increasing the degree of the polynomial.

在本实施例中，采用通过靠近端点处而不是经过端点的近似样条。In this embodiment, an approximation spline passing near the endpoints rather than passing through the endpoints is employed.

样条的一个例子是线性样条(次数为二)。在线性样条中，直线片段用于近似数据。然而，因为斜率在片段结合的地方是不连续的，所以这种线性样条缺乏平滑性。通过实施连续导数，更高次数的样条可以修复这种问题。有n段的次数为3的三次样条S(x)可以由一组多项式{S_j(x)}来表示，该多项式是在n个连续的间隔Ij上定义的：An example of a spline is a linear spline (degree two). In linear splines, straight line segments are used to approximate the data. However, this linear spline lacks smoothness because the slope is discontinuous where the segments join. Higher degree splines can fix this problem by implementing continuous derivatives. A cubic spline S(x) of degree 3 with n segments can be represented by a set of polynomials {S _j (x)} defined over n consecutive intervals Ij:

其中a_i，j是选择用于确保样条跨间隔有连续导数的系数。where a _i,j are the coefficients chosen to ensure that the spline has continuous derivatives across the interval.

在本实施例中，通过执行下文所述的处理，样条拟合解决了速度和准确度的问题。首先，通过知道当文本使用拉丁字符集时异常值大部分出现在文本行上面一半，识别文档的朝向。知道该朝向使得有可能使用不同的算法对文本行的底部和顶部拟合样条。In this embodiment, spline fitting solves the problems of speed and accuracy by performing the processing described below. First, identify the orientation of the document by knowing that outliers occur mostly in the upper half of the text line when the text uses the Latin character set. Knowing this orientation makes it possible to fit splines to the bottom and top of the text line using different algorithms.

在本实施例中，中值过滤器应用到底部端点，以便减少异常值的影响。采用一个小窗口用于过滤器，因为在文本行的下面一半有较少的异常值，而且那些异常值不趋于在英文文本中被聚集到一起。拟合到这种新的过滤后的数据集的样条称为底部样条。接下来，利用距底部样条的距离和具有大窗口尺寸的中值过滤器来过滤顶部端点。这减少了文本行顶部上大量异常值的影响，并确保顶部和底部样条是局部平行的。In this example, a median filter is applied to the bottom endpoint in order to reduce the influence of outliers. A small window is used for the filter because there are fewer outliers in the lower half of the text line, and those outliers do not tend to be clustered together in English text. The splines fitted to this new filtered data set are called bottom splines. Next, filter the top endpoints using distance from the bottom spline and a median filter with a large window size. This reduces the effect of large outliers on the top of the text line and ensures that the top and bottom splines are locally parallel.

如前面所描述的，在拟合样条之前，通过利用中值过滤器来过滤顶部和底部端点。Before fitting the spline, the top and bottom endpoints were filtered by utilizing a median filter, as previously described.

关于底部端点的过滤，在本实施例中，底部端点是利用具有小窗口尺寸w的中值过滤器过滤的。在本实施例中，w设成3。点由其x坐标值来排序。然后，每个底部端点的y坐标值被邻近点的y坐标的中值替代。对于大部分点，有2w+1个邻居，包括该点本身。这是通过在排序列表中朝该端点的左边取w个点并朝该端点的右边取w个点找到的。第一个和最后一个端点被丢弃，因为它们在一侧没有邻居。离列表任何一端距离小于窗口尺寸的其它端点应当将其窗口尺寸改变成该距离。这确保在任何给定的端点处，左右两边总是有相同个数的点，用来计算中值。选择2w+1个点(奇数)还有一个好处，即，y坐标值的中值将总是整数。Regarding the filtering of the bottom endpoints, in this embodiment, the bottom endpoints are filtered using a median filter with a small window size w. In this embodiment, w is set to 3. Points are sorted by their x-coordinate value. Then, the y-coordinate value of each bottom endpoint is replaced by the median of the y-coordinates of neighboring points. For most points, there are 2w+1 neighbors, including the point itself. This is found by taking w points to the left of that endpoint and w points to the right of that endpoint in the sorted list. The first and last endpoints are discarded because they have no neighbors on one side. Other endpoints that are less than the window size from either end of the list should have their window size changed to that distance. This ensures that at any given endpoint, there will always be an equal number of points on the left and right for calculating the median. Choosing 2w+1 points (an odd number) also has the advantage that the median of the y-coordinate values will always be an integer.

关于顶部端点的过滤，在本实施例中，使用与底部端点过滤不同的方法。因为英文文本在顶部端点数据中包含更多的异常值。考虑对应x坐标处顶部端点的y坐标和底部样条之间的距离。因为底部样条通常是可靠的，所以对于大邻域中的非异常值数据，这些距离应当是局部恒定的。因此，为了除去异常值，具有大窗口尺寸的中值过滤器应用到这些距离。每个顶部端点的y坐标用该点处的中值距离和对应x坐标处底部样条的y值之和替代。Regarding the filtering of the top endpoint, in this embodiment, a different method is used than the filtering of the bottom endpoint. Because the English text contains more outliers in the top endpoint data. Consider the distance between the y-coordinate of the top endpoint corresponding to the x-coordinate and the bottom spline. Because bottom splines are generally robust, these distances should be locally constant for non-outlier data in large neighborhoods. Therefore, to remove outliers, a median filter with a large window size is applied to these distances. The y-coordinate of each top endpoint is replaced by the sum of the median distance at that point and the y-value of the bottom spline at the corresponding x-coordinate.

一旦顶部和底部端点都过滤了，两个样条就可以拟合到每个文本行。在本实施例中，底部样条拟合到过滤后的底部端点数据集，而顶部样条拟合到过滤后的顶部端点数据集。针对这两个目的，使用相同的近似样条。所有点都可以同等地加权，样条可以是三次(次数为4)，而样条段的个数是由文本行中字符区域的个数确定的。一般来说，每个字符区域对应于一个文本字符。在有些情况下，几个文本字符或者一个词可以一起模糊到一个区域中。在一种实施例中，样条段的个数设置成字符区域的最高限度除以5，要求的最小值是两段。Once the top and bottom endpoints are filtered, two splines can be fitted to each text line. In this example, a bottom spline is fitted to the filtered bottom endpoint data set and a top spline is fitted to the filtered top endpoint data set. For both purposes, the same approximation spline is used. All points can be equally weighted, the spline can be cubic (degree 4), and the number of spline segments is determined by the number of character regions in the text line. In general, each character region corresponds to a text character. In some cases, several text characters or a single word can be blurred together into one region. In one embodiment, the number of spline segments is set to be the upper limit of the character area divided by 5, the minimum required being two segments.

用于每个文本行的样条是独立于其它文本行寻找的。然而，来自相邻文本行的信息可以用于使样条关于彼此更加一致。当找到的行跨多个文本行时，这种信息还可以用于找出文本行中的错误。The splines used for each text line are found independently of other text lines. However, information from adjacent lines of text can be used to make splines more consistent with respect to each other. This information can also be used to find errors in lines of text when the line found spans multiple lines of text.

用于确定局部文档卷曲的顶部样条可以被忽略，因为来自底部样条的数据通常足以准确地展开文档。这是因为文本行在文本行的开始或结束处有几个连续的大写文本字符，这些字符可以贡献大量在x高度线262之上的端点，这些端点将不会被中值过滤器作为异常值除去。由此，样条将不正确地向上弯曲以适合大写文本字符的顶部。然而，计算顶部样条仍然是优选的，因为顶部样条给出了关于文本行高度的其它有用信息。The top spline used to determine local document curl can be ignored, as the data from the bottom spline is usually sufficient to unwrap the document accurately. This is because the text line has several consecutive uppercase text characters at the beginning or end of the text line which can contribute a large number of endpoints above the x-height line 262 which will not be considered outliers by the median filter remove. As a result, the spline will incorrectly curve upwards to fit the tops of uppercase text characters. However, computing the top spline is still preferred because the top spline gives other useful information about the text line height.

2.3页面朝向确定2.3 Page Orientation Determination

文档有四种可能的朝向：东(0°)、北(90°)、西(180°)或者南(270°)。这是原始文档中朝上画的箭头在图像中所指的一般方向。水平样条的个数与垂直样条的个数进行比较，以确定朝向是北/南或者东/西类。由于顶部和底部样条是不同的，因此有必要区分北和南或者东和西，以便知道文本行的哪一半是上面一半。这可以通过采用如下观察来实现：在英文及使用拉丁字符集的其它语言中，由于大写文本字符、数字、标点及更多字符具有上行字母而不是下行字母，因此文本行的上面一半比下面一半有更多的异常值。There are four possible orientations for a document: East (0°), North (90°), West (180°) or South (270°). This is the general direction the arrow drawn upwards in the original document points in the image. The number of horizontal splines is compared to the number of vertical splines to determine if the orientation is North/South or East/West. Since the top and bottom splines are different, it is necessary to distinguish between north and south or east and west in order to know which half of the text line is the upper half. This can be achieved by exploiting the observation that in English and other languages that use Latin character sets, the upper half of a text line is larger than the lower half because uppercase text characters, numbers, punctuation, and more have ascenders rather than descenders There are more outliers.

因此，为了区分文档的顶部和底部，在本实施例中，选择其长度靠近所有文本行的中值长度的文本行的代表性样本。对于样本中的每个文本行，顶部是通过检查哪一侧具有更多的异常值来找的。这可以通过对顶部和底部端点集都应用底部样条拟合算法并测量这些拟合中的误差来进行。在一个实施例中，当产生等价朝向的文本行的个数是文档中所有文本行的至少5％并且超过产生可选朝向的文本行的个数至少两个时，朝向被确定。这确保朝向检测在99％的时候是准确的。Therefore, in order to distinguish the top from the bottom of the document, in this embodiment, a representative sample of text lines whose length is close to the median length of all text lines is selected. For each line of text in the sample, the top is found by checking which side has more outliers. This can be done by applying the bottom spline fitting algorithm to both the top and bottom endpoint sets and measuring the error in these fits. In one embodiment, an orientation is determined when the number of lines of text producing equivalent orientations is at least 5% of all lines of text in the document and exceeds the number of lines of text producing alternative orientations by at least two. This ensures that orientation detection is accurate 99% of the time.

关于文本行选择，典型的文档包含100至200个文本行。因此，理想地，其中只有非常少的样本用于朝向计算步骤，这显著地比常规样条拟合慢。通常，需要5至10个文本行来结论性地确定朝向，但是由于“赢两个(win by two)”标准，这个数可以变化。在本实施例中，为了减少由于噪声产生的错误的个数，文本行首先要根据其长度进行排序。太短或太长的文本行更有可能是噪声，而且长文本行趋于比短文本行给出更准确的结果。计算所有文本行的平均和中值长度，而且这两个数中的最大值被认为是最优的行长度。然后，根据它们的长度与最优行长度之间的差，将所有文本行排序。因此，合理的文本行长度是在异常值之前考虑的。Regarding text line selection, a typical document contains 100 to 200 text lines. Therefore, ideally, only very few samples are used towards the computational step, which is significantly slower than conventional spline fitting. Typically, 5 to 10 lines of text are required to conclusively determine orientation, but this number can vary due to the "win by two" criterion. In this embodiment, in order to reduce the number of errors due to noise, the text lines are first sorted according to their length. Lines of text that are too short or too long are more likely to be noise, and long lines of text tend to give more accurate results than short lines of text. The average and median lengths of all text lines are calculated, and the maximum of these two numbers is considered the optimal line length. All text lines are then sorted according to the difference between their lengths and the optimal line length. Therefore, reasonable text line lengths are considered before outliers.

关于误差度量，在样条拟合到每个文本行的顶部和底部之后，这两个拟合的误差可以进行比较。拟合的误差是通过考虑每个端点处的误差来计算的。端点处的误差是在该端点的y坐标和对应x坐标处样条函数的值之间的差。这些逐级(point-wise)误差可以被求和并由用于计算拟合误差的端点个数来依比例确定。Regarding the error measure, after a spline is fitted to the top and bottom of each text line, the errors of the two fits can be compared. The error of the fit is calculated by considering the error at each endpoint. The error at an endpoint is the difference between the y-coordinate of that endpoint and the value of the spline function at the corresponding x-coordinate. These point-wise errors can be summed and scaled by the number of endpoints used to calculate the fit error.

由于顶部样条具有更多异常值的假设来自字符是拉丁字母的假设，因此该方法对其它字符集可能需要进行修改。因此，为了得出文本行的朝向，对拟合误差中需要有多大的差别设置阈值。这个阈值确保当不能正确地确定朝向时不会不正确地进行关于文本朝向的假设。如果不满足阈值，则文本被认为是右侧向上或顺时针旋转90°。一旦朝向可以确定，则展开步骤就可以用于正确地旋转图像。Since the assumption that the top spline has more outliers comes from the assumption that the characters are Latin letters, this method may need to be modified for other character sets. Thus, a threshold is set on how much difference in fit error needs to be in order to derive the orientation of the text line. This threshold ensures that assumptions about text orientation are not incorrectly made when the orientation cannot be correctly determined. If the threshold is not met, the text is considered right up or rotated 90° clockwise. Once the orientation can be determined, the unwrapping step can be used to correctly rotate the image.

下文列出为本实施例实现所选的参数：(1)用于底部样条的中值过滤器的窗口尺寸设置成7。选择这个值是因为在每个文本字符都可以大致找到两个端点，因此窗口包括在该端点右边的一个文本字符和在该端点左边的一个文本字符。(2)用于顶部样条的中值过滤器的窗口尺寸设置成21。这个值选成比用于底部样条的窗口尺寸大得多，以便使对顶部端点的过滤更严格。(3)每行的样条段的个数设置成字符区域的个数的最高限度除以5，这要求每行至少有两个样条段。(4)有效文本行中区域的最小个数设置成5，以确保有足够的数据点来定义样条。The parameters selected for the implementation of this embodiment are listed below: (1) The window size of the median filter for the bottom spline is set to 7. This value was chosen because approximately two endpoints can be found at each text character, so the window includes one text character to the right of the endpoint and one text character to the left of the endpoint. (2) The window size of the median filter used for the top spline is set to 21. This value is chosen to be much larger than the window size used for the bottom spline in order to allow stricter filtering of the top endpoints. (3) The number of spline segments in each line is set as the maximum number of character regions divided by 5, which requires at least two spline segments in each line. (4) The minimum number of regions in a valid text line is set to 5 to ensure that there are enough data points to define the spline.

2.4异常值去除与垂直段落边界确定2.4 Outlier removal and vertical paragraph boundary determination

现在将描述异常值去除与垂直段落边界确定步骤126。在这个时候，连接的文本区域已经识别出来并分组成可能的文本行。对于每个可能的文本行，计算用于每个像素连接区域的质心。然后，计算用于每个文本行的近似朝向。朝向与大部分其它文本行非常不同的文本行被丢弃。比其它文本行短得多的文本行也被丢弃。在一个实施例中，采用Matlab的“clustercentroids”函数来实现异常值去除处理。The outlier removal and vertical paragraph boundary determination step 126 will now be described. At this point, connected text regions have been identified and grouped into possible text lines. For each possible text line, compute the centroid for each pixel-connected region. Then, an approximate orientation for each line of text is calculated. Lines of text whose orientation is very different from most other lines of text are discarded. Text lines that are much shorter than other text lines are also discarded. In one embodiment, the "clustercentroids" function of Matlab is used to implement the outlier removal process.

在消除了错误的文本行后，可以收集每个文本行的起点和终点。Hough变换可以用于确定文本行的起点是否对齐——如果是的话，则找到了描述段落左边缘的行。类似地，如果文本行的终点对齐，则段落是右对齐的并且找到了段落的右侧。如果找到了这些段落的边界，则它们可以用于在最后的栅格建立步骤132中补充(随后在算法中收集的)垂直笔划信息。在最后的栅格建立步骤132中，给予这种段落边界信息比垂直笔划信息更多的权重。After eliminating erroneous text lines, the start and end points of each text line can be collected. The Hough transform can be used to determine whether the start of a line of text is aligned - if so, the line describing the left edge of the paragraph has been found. Similarly, if the end points of the lines of text are aligned, the paragraph is right-aligned and the right side of the paragraph is found. If these paragraph boundaries are found, they can be used in the final grid building step 132 to supplement the vertical stroke information (collected later in the algorithm). In the final grid building step 132, such paragraph boundary information is given more weight than vertical stroke information.

2.5垂直笔划检测2.5 Vertical stroke detection

在本实施例中，垂直笔划检测步骤128是通过首先用文本像素与文本行的质心样条相交来执行的。在每个交点处，通过沿局部垂直方向进行扫描，获得大致垂直的像素块。每个块的局部垂直方向可以利用最小二乘线性拟合来估计。然后，所获得的这像素集利用拟合后的二次多项式过滤，这有利于被检测笔划中朝向的线性度和一致性。拟合后多项式的异常值可以被除去而不予考虑。在一个实施例中，异常值是通过使用10°的手调阈值来除去的。然后，结果可以通过使用平均过滤器来平滑。In this embodiment, the vertical stroke detection step 128 is performed by first intersecting the text pixel with the centroid spline of the text line. At each intersection point, a roughly vertical block of pixels is obtained by scanning along a local vertical direction. The local vertical orientation of each block can be estimated using a least squares linear fit. Then, the obtained set of pixels is filtered using a fitted quadratic polynomial, which facilitates the linearity and consistency of orientation among detected strokes. Outliers of the fitted polynomial can be removed without consideration. In one embodiment, outliers are removed by using a manually adjusted threshold of 10°. The result can then be smoothed by using an averaging filter.

可选地，异常值还可以用于找出垂直笔划，尤其是当照相机分辨率提高时。已经证明，越大的像素集越容易分析边界，而不是内部。这是因为越大的像素集具有越明确定义的边界，而内部的尺寸比边界的尺寸增长得快。Optionally, outliers can also be used to find vertical strokes, especially as camera resolution increases. It has been shown that larger pixel sets make it easier to analyze boundaries, rather than interiors. This is because larger sets of pixels have more well-defined boundaries, and the size of the interior grows faster than the size of the boundaries.

3.图像变换3. Image transformation

在本实施例中，在这个图像变换步骤106中执行两个子步骤。这些子步骤是插值创建步骤130和栅格建立与展开步骤132。In this embodiment, two sub-steps are performed in this image transformation step 106 . These sub-steps are the interpolation creation step 130 and the grid creation and unwrapping step 132 .

在栅格建立与展开步骤132中，提取出的特征用作识别文档卷曲的基础。产生一个矢量域来表示文档图像在每个点处所要求的水平和垂直拉伸。可选地，栅格建立与展开步骤132可以由基于优化的展开步骤134替代。In the grid building and unfolding step 132, the extracted features are used as the basis for identifying document curls. Generates a vector field representing the desired horizontal and vertical stretch of the document image at each point. Alternatively, the grid building and unfolding step 132 may be replaced by an optimization-based unfolding step 134 .

3.1插值器创建3.1 Interpolator Creation

在这个插值器创建步骤130中，从来自顶部和底部样条的垂直笔划和水平信息来创建用于垂直信息的插值器。在本实施例中，成像文档的展开是通过对成像文档应用两个维度的变形执行的。变形是对成像文档的局部拉伸，其目的是产生看起来象平的文档的图像。成像文档应当拉伸多少可以根据来自局部提取特征的数据局部地确定。这些特征可以是成像文档中拟合到两个矢量集中一个的2D矢量。第一个集合的矢量与文档中文本的方向平行，而第二个集合的矢量与文档文本中的垂直笔划的方向平行。在原始图像的卷曲文档中，这些集合中的矢量可能指向任何方向。期望拉伸图像，使得这两个矢量集变得正交，每个集合中的所有矢量都指向相同的方向。平行于文本行的矢量应当都指向水平方向，而平行于垂直笔划的矢量应当都指向垂直方向。In this interpolator creation step 130, an interpolator for vertical information is created from the vertical stroke and horizontal information from the top and bottom splines. In this embodiment, the unwrapping of the imaged document is performed by applying a two-dimensional warp to the imaged document. Warping is the local stretching of an imaged document in order to produce an image that looks like a flat document. How much the imaged document should be stretched can be locally determined from data from locally extracted features. These features may be 2D vectors in the imaging document fitted to one of two vector sets. The vectors of the first set are parallel to the direction of the text in the document, and the vectors of the second set are parallel to the direction of the vertical strokes in the text of the document. In a curl document for raw images, the vectors in these sets may point in any direction. It is desired to stretch the image such that these two sets of vectors become orthogonal, with all vectors in each set pointing in the same direction. Vectors parallel to lines of text should all point horizontally, and vectors parallel to vertical strokes should all point vertically.

平行的矢量可以通过计算规则隔开的间隔处的文本行样条的单位正切矢量来提取。而且，来自每个文本行的垂直笔划可以通过寻找和文本中大致与每个文本行的质心样本正交的暗线对应的一组平行线来提取。每个垂直笔划都可以表示为笔划的位置与方向中的单位矢量。每个垂直笔划的角度可以通过使用最小二乘线性回归来估计。在这里，平行的矢量称为正切矢量，而垂直笔划矢量称为法向矢量。应当指出，在展开的文档中，法向矢量与正切矢量正交。然而，在文档的原始图像中，透视变形和页面弯曲使得这些矢量之间的角度大于或者小于90°。Parallel vectors can be extracted by computing the unit tangent vectors of the text line splines at regularly spaced intervals. Also, the vertical strokes from each text line can be extracted by finding a set of parallel lines corresponding to dark lines in the text approximately orthogonal to the centroid samples of each text line. Each vertical stroke can be represented as a unit vector in the position and direction of the stroke. The angle of each vertical stroke can be estimated by using least squares linear regression. Here, the parallel vectors are called tangent vectors, and the perpendicular stroke vectors are called normal vectors. It should be noted that in the expanded document, the normal vector is orthogonal to the tangent vector. However, in the original image of the document, perspective distortion and page curvature make the angle between these vectors larger or smaller than 90°.

下文描述基本的插值处理。第一步是跨整个文档插入正切和法向矢量。这对于确定如何展开图像中没有文本或者文本不提供有用信息的部分是必不可少的。Java类可以用于存储已知的单位矢量(x，y，θ)。一旦这个类的对象收集到了所有已知的矢量，则在指定位置(x，y)处的未知矢量的角度θ可以通过取(x，y)局部邻域中附近已知矢量的加权平均值来获得。由于，因此这可能是很复杂的。由于在π-ε的一个角度非常接近在-π+ε的另一个角度(其中ε是某个非常小的数)，因此普通插值技术未必能很好地工作。角度是由已知矢量的加权平均值计算的，其中每个已知矢量v的权重是利用以下函数计算的。Basic interpolation processing is described below. The first step is to interpolate the tangent and normal vectors across the entire document. This is essential for determining how to expand parts of the image where there is no text or where the text does not provide useful information. A Java class can be used to store a known unit vector (x, y, θ). Once all known vectors have been collected for objects of this class, the angle θ of the unknown vector at a given location (x, y) can be calculated by taking the weighted average of the nearby known vectors in the (x, y) local neighborhood get. because , so this can be quite complicated. Since one angle at π-ε is very close to another angle at -π+ε (where ε is some very small number), ordinary interpolation techniques may not work well. The angles are calculated from the weighted average of the known vectors, where the weight of each known vector v is calculated using the following function.

$w w ((d d)) = = \frac{11}{11 + + {e e}^{1010 d d / / r r - - 55}}$

其中r是邻域的半径，而d是v和(x，y)之间的距离。where r is the radius of the neighborhood and d is the distance between v and (x,y).

应当指出，d＜r，因此，当d接近r时，w(d)变得非常小。当d接近0时，w(d)变得非常接近1。在本实施例中，等式中的常数(10和5)用于以平滑的方式归一化0和1之间的权重值。这些值可以改变，以便改变结果。参数r确定矢量影响的半径。参数r可以任意地设置在100个像素。然而，其它数也可以使用，因为如果邻域中没有矢量，则搜索将继续超出该邻域，将非常低的权重分配给任何发现的矢量。参数r可以任意选择，因为底层数据结构是kd树，该树支持快速最近邻居搜索。对于关于kd树的更多信息，见1990年Proceedingsof the Sixth Annual Symposium on Computational Geometry上第187-197页由Jon Louis Bentley所写的“K-d trees for SemidynamicPoint Sets”。It should be noted that d<r, so w(d) becomes very small as d approaches r. As d approaches 0, w(d) becomes very close to 1. In this embodiment, the constants (10 and 5) in the equation are used to normalize the weight values between 0 and 1 in a smooth manner. These values can be changed in order to change the results. The parameter r determines the radius of the vector's influence. The parameter r can be set arbitrarily at 100 pixels. However, other numbers could be used, since if there are no vectors in the neighborhood, the search will continue beyond that neighborhood, assigning very low weight to any vectors found. The parameter r can be chosen arbitrarily, since the underlying data structure is a kd tree, which supports fast nearest neighbor searches. For more information on kd trees, see "K-d trees for Semidynamic Point Sets" by Jon Louis Bentley, Proceedings of the Sixth Annual Symposium on Computational Geometry, 1990, pp. 187-197.

先前描述的基本插值处理对于所提取特征数量密集的文档区域工作得相当好。然而，当两个密集的区域被一个稀疏的区域隔开时，突然的变化而不是平滑的插值可以通过该稀疏的区域显示出来。完全平滑的插值是不期望的，因为当一个文档部分遮挡了另一个文档时，它会导致不正确的结果。另一方面，当所讨论的所有区域是同一个文档的部分时，不连续性也是不期望的。The basic interpolation process described previously works reasonably well for regions of the document where the number of extracted features is dense. However, when two dense regions are separated by a sparse region, abrupt changes rather than smooth interpolation can show through the sparse region. Fully smooth interpolation is not desired because it can lead to incorrect results when one document partially occludes another. On the other hand, discontinuities are also undesirable when all regions in question are part of the same document.

因此，利用指数函数作为权重函数的基础可以允许对这种行为的部分实现。这限制了正常条件下矢量对搜索邻域的缺省半径的影响。Therefore, utilizing an exponential function as the basis for the weight function may allow partial realization of this behavior. This limits the vector's influence on the default radius of the search neighborhood under normal conditions.

插值处理也实现了基本的异常值去除。一旦插值对象存储了所有已知的矢量，则每个矢量都从该插值对象除去，并且查询该对象来获得在那个点处的插入值。如果实际的矢量和插值的矢量的角度差超过某个阈值，则该矢量不加回到插值对象。阈值可以是1°，这确保用于展开的所有矢量都与围绕它的那些矢量一致。由于不正确特征提取造成的矢量中的大部分错误都被除去了。这种方法可能导致太平滑，因为它阻止矢量中的突然变化。Interpolation also implements basic outlier removal. Once the interpolation object stores all known vectors, each vector is removed from the interpolation object, and the object is queried to obtain the interpolated value at that point. If the angle difference between the actual vector and the interpolated vector exceeds a certain threshold, the vector is not added back to the interpolated object. The threshold can be 1°, which ensures that all vectors used for unwrapping coincide with those surrounding it. Most of the errors in the vectors due to incorrect feature extraction are removed. This method can result in too smooth, because it blocks sudden changes in the vector.

以下描述插值的优选实施例。这种插值器创建步骤130是基于将两维表面拟合到矢量域。从n次多项式函数开始，最小二乘误差方法用于将表面拟合到水平和垂直矢量域。由于Runge现象，这些函数在图像的边缘可能振荡。这个问题可以通过用两维三次多项式样条替代高次多项式来解决。A preferred embodiment of interpolation is described below. This interpolator creation step 130 is based on fitting a two-dimensional surface to a vector domain. Starting from a polynomial function of degree n, a least-squares error method is used to fit the surface to the horizontal and vertical vector domains. Due to the Runge phenomenon, these functions may oscillate at the edges of the image. This problem can be solved by replacing higher degree polynomials with two-dimensional cubic polynomial splines.

关于垂直插值，在找到一些表示到文档垂直曲率的正切的垂直笔划之后，可以插入跨图像的这种信息。在本实施例中，垂直插值是通过构造最好地近似垂直数据的平滑连续函数来执行的。Regarding vertical interpolation, this information can be interpolated across images after finding some vertical strokes that represent a tangent to the vertical curvature of the document. In this embodiment, vertical interpolation is performed by constructing a smooth continuous function that best approximates the vertical data.

关于角度，垂直笔划数据可以表示为与其坐标耦合的每个垂直笔划的角度。因为构成基本操作(例如，找平均值)的关于角度的模运算，这种表示可能是复杂的。这个问题可以通过假设所有角度都在文档的平均水平和平均垂直角度的加或减90°之内(分别对于正切和垂直矢量域)来解决。所有角度都被移到这些范围中，并且假设表面将不包含这些范围之外的任何角度。这种假设对于沿任何方向都没有弯曲超过90°的任何文档都是成立的。Regarding angles, vertical stroke data can be expressed as the angle of each vertical stroke coupled to its coordinates. This representation can be complicated because of the modulo operations on angles that make up the basic operations (eg, finding the mean). This problem can be solved by assuming that all angles are within plus or minus 90° of the document's mean horizontal and mean vertical angles (for the tangent and vertical vector domains, respectively). All angles are shifted into these ranges, and it is assumed that the surface will not contain any angles outside of these ranges. This assumption is true for any document that is not bent more than 90° in any direction.

一旦角度被约束到适当的范围内，它们就可以当做正规数据(regular data)来对待，而不用担心模运算。Once the angles are constrained to the appropriate range, they can be treated as regular data without worrying about modulo operations.

关于水平插值，拟合到文本行的顶部和底部的样条遵循文档的水平曲率。在每个像素处正切的角度可以提取到样条，而且可以构造最好地近似这种水平正切数据的平滑连续函数。就象关于垂直插值一样，角度首先移到适当的范围中，然后作为正规数据对待。这种范围是通过对垂直角度范围加90°来获得的。Regarding horizontal interpolation, the splines fitted to the top and bottom of the lines of text follow the horizontal curvature of the document. The angle of tangent at each pixel can be extracted to a spline, and a smooth continuous function that best approximates this horizontal tangent data can be constructed. As with vertical interpolation, the angles are first shifted into the proper range and then treated as normal data. This range is obtained by adding 90° to the vertical angular range.

下一步是找出最好地近似这种数据的插值函数。本实施例数据的显著特性是它不是在栅格上定义的，而是跨图像散开。首先，两维高次多项式可以用作插值函数。然后，薄板样条可以作为可选的插值技术来对待，该技术可以更好地处理非栅格化数据。The next step is to find the interpolation function that best approximates this data. A notable property of the data in this example is that it is not defined on a raster, but spread out across the image. First, a two-dimensional high-degree polynomial can be used as an interpolation function. Thin plate splines can then be treated as an optional interpolation technique that works better with non-rasterized data.

关于2D多项式，目的是利用最小二乘法将n次多项式拟合到数据。设立等式的超定(over-determined)线性系统，以便找到多项式的系数。多项式具有的形式。在具有坐标(x_i，y_i)和角度θ_i的每个数据点，可以获得等式p(x_i，y_i)＝θ_i，其中系数a_j是未知的。对M个数据点中的每一个重复这个过程，可以获得具有N个等式和(n+1)²个未知数的等式的线性系统。发现n＝10和n＝30分别对于垂直和水平数据是足够的。近似地可以期望N＝10000个数据点，因此这产生了超定的系统。在本实施例中，Matlab中的反斜杠运算符用于求解超定的系统，因为最小二乘误差方法对于n＞20具有数字不稳定性问题。With respect to 2D polynomials, the goal is to fit a polynomial of degree n to the data using the method of least squares. An over-determined linear system of equations is set up in order to find the coefficients of the polynomial. polynomial has form. At each data point with coordinates (x _i , y _i ) and angle θ _i , the equation p(x _i , y _i ) = θ _i can be obtained, where the coefficients a _j are unknown. Repeating this process for each of the M data points yields a linear system of equations with N equations and (n+1) ² unknowns. It was found that n=10 and n=30 were sufficient for vertical and horizontal data, respectively. Approximately N=10000 data points can be expected, so this results in an overdetermined system. In this example, the backslash operator in Matlab was used to solve the overdetermined system because the least squares error method has numerical instability problems for n>20.

这里的目的是找出关于n次多项式的常数，该常数最小化在所有数据点得到的误差之和。误差函数可以写成E＝∑_i(θ_i-p(x_i，y_i))²，其中该和跨所有数据点p(x_i，y_i)，每个点都有与其关联的角度θ_i，而且p是n次的未知多项式函数。如果函数有常数a_i、...、a_(n+1) ²，则期望关于那些常数来最小化误差。因此，令对所有a_i都有dE/da_i＝0，可以获得具有n个未知数的n等式系统。它也碰巧是线性系统。因此，需要求解的是对于包含系数a_j的未知矢量x的M_x＝b。M是n×n的矩阵，而b是长度为n的矢量。矩阵M碰巧是对称正定的，因此系统可以通过使用Cholesky因式分解来求解，并由此获得多项式的系数。The goal here is to find the constant with respect to a polynomial of degree n that minimizes the sum of errors obtained at all data points. The error function can be written as E=∑ _i (θ _i -p(xi _, y _i )) ² , where the sum spans all data points p( _xi , y _i ), each of which has an angle θ _i associated with it , and p is an unknown polynomial function of degree n. If the function has constants a _i , . . . , a _(n+1) ² , it is desirable to minimize the error with respect to those constants. Therefore, let dE/da _i =0 for all a _i , a system of n equations with n unknowns can be obtained. It also happens to be a linear system. Therefore, what needs to be solved is M _x =b for the unknown vector x including the coefficient a _j . M is an n×n matrix, and b is a vector of length n. The matrix M happens to be symmetric positive definite, so the system can be solved by using Cholesky factorization, and thus obtain the coefficients of the polynomial.

如果多项式呈现出Runge现象并开始围绕图像的边缘剧烈振荡，尤其是当图像在中心之外的数据稀疏的时候，这可以通过将文档划分成栅格并在每个没有数据的栅格单元中添加包含该文档角度的数据点来解决。If the polynomial exhibits the Runge phenomenon and starts oscillating violently around the edges of the image, especially if the image has sparse data outside the center, this can be fixed by dividing the document into rasters and adding Data points containing the angle of the document to resolve.

可选地，两维三次样条插值可以用作高次多项式插值，因为它避免了Runge现象。Matlab的2D三次样条函数只可以在栅格化的数据上使用。应当找到关于栅格的值，使得在该栅格上生成的三次样条可以最好地近似数据。Alternatively, two-dimensional cubic spline interpolation can be used as higher degree polynomial interpolation since it avoids the Runge phenomenon. Matlab's 2D cubic spline function can only be used on rasterized data. Values should be found on a grid such that a cubic spline generated on that grid best approximates the data.

在本实施例中，10×10的栅格用于垂直插值，而30×30的栅格用于水平插值，以获得更精细的分辨率。需要产生一组n²个样条基函数e_i，这些函数是在n×n栅格上的样条，该栅格在第i个单元中包含1，其它都是0。在第i个单元中包含值a_i的n×n栅格上的样条等于∑_ia_ie_i。用于该样条的误差函数是In this embodiment, a 10×10 grid is used for vertical interpolation, and a 30×30 grid is used for horizontal interpolation to obtain finer resolution. It is necessary to generate a set of n ² spline basis functions e _i , these functions are splines on an n×n grid, the grid contains 1 in the i-th cell, and all others are 0. A spline on an n×n grid containing the value a _i in the ith cell is equal to ∑ _i a _i e _i . The error function used for this spline is

$E E. = = \underset{\overset{&RightArrow; &Right Arrow;}{x x}}{Σ Σ} {((\underset{i i}{Σ Σ} (({a a}_{i i} {e e}_{i i} ((\overset{&RightArrow; &Right Arrow;}{x x})) - - θ θ ((\overset{&RightArrow; &Right Arrow;}{x x}))))))}^{22}$

其中是处的角度。in yes at the angle.

期望找到最小化误差函数的系数a_i。然而，如果有不包含任何数据的栅格单元，则那些单元中的样条行为可能不受约束。因此，在本实施例中，小的约束项添加到误差函数。这使得系数a_i(该系数是在没有数据点的栅格单元i处)等于i的四个相邻栅格单元的a_j的平均系数。在一个实施例中，e设置成稍高，以便还约束包含很少数据点的单元。新的误差函数可以写成：It is desired to find the coefficients a _i that minimize the error function. However, if there are raster cells that do not contain any data, the spline behavior in those cells may be unconstrained. Therefore, in this example, the small constraint term added to the error function. This makes the coefficient a _i (the coefficient at grid cell i with no data point) equal to the average coefficient of a _j of i's four neighboring grid cells. In one embodiment, e is set slightly higher to also constrain cells containing few data points. The new error function can be written as:

$E E. = = \underset{\overset{&RightArrow; &Right Arrow;}{x x}}{Σ Σ} {((\underset{i i}{Σ Σ} (({a a}_{i i} {e e}_{i i} ((\overset{&RightArrow; &Right Arrow;}{x x})) - - θ θ ((\overset{&RightArrow; &Right Arrow;}{x x}))))))}^{22} + + {Σ Σ}_{{i i,, j j}_{adjacentcells adjacent cells}} ϵ ϵ {(({a a}_{i i} - - {a a}_{j j}))}^{22}$

这产生超定的线性等式系统。在一个实施例中，这种系统是利用Matlab求解的。最后，在第i个单元中有值a_i的这种栅格上的样条产生并可用于插值原始数据。This produces an overdetermined system of linear equations. In one embodiment, such a system is solved using Matlab. Finally, a spline on such a grid with value a _i in the i-th cell is produced and can be used to interpolate the original data.

3.2.栅格建立与展开3.2. Grid establishment and expansion

在本实施例中，栅格建立与展开步骤132涉及建立具有以下属性的栅格。(1)所有栅格单元都是四边形的。(2)栅格单元的四个角部必须与所有紧挨着的邻居共享。(3)每个栅格单元都小到足以使文档在该单元中的局部曲率是大致恒定的。(4)栅格单元的侧边必须与正切或者法向矢量平行。(5)跨卷曲的图像的每个栅格单元对应于原始文档中固定大小的方形。In this embodiment, the grid building and unfolding step 132 involves building a grid with the following properties. (1) All grid cells are quadrangular. (2) The four corners of a grid cell must be shared with all immediate neighbors. (3) Each grid cell is small enough that the local curvature of the document in that cell is roughly constant. (4) The sides of the grid cells must be parallel to the tangent or normal vector. (5) Each grid cell across the warped image corresponds to a fixed-size square in the original document.

处理以在图像的中心放置任意的栅格单元开始。该栅格单元旋转，直到其满足以上的第四条标准。然后，栅格单元可以利用已知的栅格单元向外建立，来固定要建立的栅格单元的两个或者三个角部点。最后一个点可以通过查询插值对象以获得该位置处的正切和法向矢量然后沿该方向步进来计算。Processing starts by placing an arbitrary grid cell in the center of the image. The grid cell is rotated until it meets the fourth criterion above. The grid cells can then be built outward using the known grid cells to fix the two or three corner points of the grid cells to be built. The last point can be computed by querying the interpolation object for the tangent and normal vectors at that position and then stepping in that direction.

在大多数情况下，要建立的栅格单元的三个角部点是已经知道的。因此，要建立的栅格单元的两个侧边可以精确地在一个点处相交，这可以用来确定要建立的栅格单元的第四个角部点。当要建立的栅格单元是直接从中心单元水平或垂直地添加的时候，只有两个角部点是已知的。在这种情况下，该处理会有点任意。In most cases, the three corner points of the grid cell to be created are already known. Therefore, the two sides of the grid cell to be created can intersect at exactly one point, which can be used to determine the fourth corner point of the grid cell to be created. When the grid cells to be built are added horizontally or vertically directly from the central cell, only the two corner points are known. In this case, the processing would be somewhat arbitrary.

如果很好地解决了与栅格建立处理关联的两个问题，则栅格建立与展开步骤132可以更好地执行。当需要确定水平拉伸文本多少及在什么地方拉伸的时候，第一个问题出现。一旦正切矢量和垂直笔划被正确地识别出来，文档就可以利用直的文本行展开。然而，除非文本字符沿每个文本行水平地拉伸不同的程度，否则文本可能看起来不够美观。关于照相机弯曲的页面部分上的文本字符将看起来是水平变形的，具有变窄的宽度。而纸张相对平的部分上的文本字符将看起来是正常的。在一个实施例中，当文本的水平拉伸本质具有非常准确的正切和法向矢量时，测试并校正这种拉伸的附加代码可以用于解决这种问题。The grid building and unfolding step 132 may perform better if the two problems associated with the grid building process are well resolved. The first problem arises when it is necessary to determine how much to stretch text horizontally and where. Once the tangent vectors and vertical strokes are correctly identified, the document can be expanded with straight lines of text. However, unless the text characters are stretched horizontally to varying degrees along each text line, the text may not look aesthetically pleasing. Text characters on portions of the page that are curved about the camera will appear distorted horizontally, with a narrowed width. Text characters on relatively flat parts of the paper will appear normal. In one embodiment, when the horizontal stretching nature of text has very accurate tangent and normal vectors, additional code that tests for and corrects for such stretching can be used to address this issue.

第二个问题是栅格建立处理从某个中心单元向外建立栅格。这意味着正切和垂直笔划中任何小的误差都将向外传播通过整个栅格。栅格建立处理中早期的小误差会造成大的栅格建立误差，从而异常地扩展或者收缩栅格单元。在一个实施例中，建立多个栅格单元可以用于解决这个问题。The second problem is that the grid building process builds the grid outward from some central cell. This means that any small errors in tangential and perpendicular strokes will be propagated outward through the entire grid. Small errors early in the grid-building process can cause large grid-building errors, expanding or shrinking grid cells abnormally. In one embodiment, creating multiple grid cells can be used to address this issue.

3.3.基于优化的展开3.3. Optimization-based deployment

可选地，基于优化的展开步骤134可以作为最后的展开变换步骤106来执行。基于优化的展开步骤134找到确定输出图像中的每个像素应当从原始图像的什么地方采样的映射。展开功能以全局方式计算该映射，从而将其与栅格建立区分开。Optionally, the optimization-based unwrapping step 134 may be performed as the last unwrapping transformation step 106 . The optimization-based unwrapping step 134 finds a map that determines where in the original image each pixel in the output image should be sampled from. The unfolding function computes this mapping globally, which distinguishes it from raster building.

在本实施例中，基于优化的展开步骤134是在两个步骤中执行的。首先，考虑输入图像中像素的多个子集，并确定这些像素应当映射到输出图像的什么地方。这些像素称为控制点。该问题构造为优化问题，该问题指定理想解的属性并搜索解空间以获得最优解。In this embodiment, the optimization-based deployment step 134 is performed in two steps. First, consider subsets of pixels in the input image and determine where those pixels should map into the output image. These pixels are called control points. The problem is formulated as an optimization problem that specifies properties of an ideal solution and searches the solution space for an optimal solution.

第二，一旦在输入图像中获得了一组控制点，平滑插值就可以跨它们执行，以确定原始图像中的每个点应当映射到什么地方。这从文本特征确定了原始图像的自然拉伸。插值可以利用薄板样条来实现。Second, once a set of control points is obtained in the input image, smooth interpolation can be performed across them to determine where each point in the original image should map to. This determines the natural stretch of the original image from the text features. Interpolation can be achieved using thin plate splines.

为了构造优化函数，首先找到原始图像中一组很容易映射到输出图像的点。如果这组点贯穿输入图像良好地分布则更好。在本实施例中，选择沿每个文本行均匀隔开的固定数量的点。To construct the optimization function, first find a set of points in the original image that are easily mapped to the output image. It is better if the set of points is well distributed throughout the input image. In this embodiment, a fixed number of points evenly spaced along each text line is chosen.

优化问题可以设置成找出这些点应当映射到输出图像的什么地方。优化问题包括估计可能的点映射中的误差的误差函数。这个误差函数也称为目标函数。在一个实施例中，Matlab用于最小化优化问题中误差的标准方法的实现可以用于找到最优解。An optimization problem can be set up to find where in the output image these points should be mapped. The optimization problem consists of estimating an error function for errors in possible point maps. This error function is also called the objective function. In one embodiment, Matlab's implementation of standard methods for minimizing error in optimization problems can be used to find optimal solutions.

目标函数考虑文本行的几个属性，以便计算可能的点映射的误差。例如，在好的映射中，同一文本行中的所有点都沿一条直线，相邻的文本行都均匀地隔开，而文本行是左对齐的。The objective function considers several properties of the text line in order to compute the error of possible point mappings. For example, in a good mapping, all points in the same text line lie along a straight line, adjacent text lines are all evenly spaced, and text lines are left-aligned.

一旦目标函数已经用于确定控制点从输出图像到输入图像的映射，薄板样条就可以用于插值用于其它像素的映射。Once the objective function has been used to determine the mapping of control points from the output image to the input image, thin plate splines can be used to interpolate the mapping for other pixels.

在本实施例中，通过将图像变换建模为薄板样条，这些控制点的映射用于产生用于整个图像的映射。薄板样条是插值在两个维度中出现的离散数据的参数化函数族。它们通常在图像处理中用于表示非严格的变形。薄板样条的几个属性使得它们对于基于优化的展开是理想的。最重要的是，它们平滑地插值了离散的数据。大多数其它的两维数据拟合方法或者不严格地插值或者需要数据出现在一个栅格上。In this embodiment, the mapping of these control points is used to generate a mapping for the entire image by modeling the image transformation as a thin plate spline. Thin plate splines are a family of parametric functions that interpolate discrete data occurring in two dimensions. They are often used in image processing to represent non-rigorous deformations. Several properties of thin plate splines make them ideal for optimization-based unfolding. Most importantly, they smoothly interpolate discrete data. Most other two-dimensional data fitting methods either do not strictly interpolate or require the data to appear on a grid.

通用样条是参数化的函数族，设计成通过最小化函数的误差测量和粗糙度测量的加权平均值来创建在离散的数据点匹配数据值的平滑函数。(见2006年MathWork公司由Carl de Boor所写的“SplinesToolbox User’s Guide”，该文献通过引用并入于此。)误差的测量是在数据点处的最小二乘误差。对于在R²出现的标量数据，函数可以看作是三维形状。函数粗糙度的一个可能测量是由金属薄板的弯曲能量的物理模拟来定义的：Generic splines are a parametric family of functions designed to create smooth functions that match data values at discrete data points by minimizing a weighted average of the function's error measure and roughness measure. (See "SplinesToolbox User's Guide" by Carl de Boor, MathWork, Inc., 2006, which is hereby incorporated by reference.) The measure of error is the least squares error at the data points. For scalar data occurring at ^R2 , the function can be viewed as a three-dimensional shape. One possible measure of functional roughness is defined by a physical simulation of the bending energy of a sheet metal:

$R R ((f f)) = = {&Integral; &Integral;}_{- - \infty \infty}^{\infty \infty} {&Integral; &Integral;}_{- - \infty \infty}^{\infty \infty} [[{| | {f f}_{xx xx} | |}^{22} + + 22 {| | {f f}_{xy xy} | |}^{22} + + {| | {f f}_{yy yy} | |}^{22}]] dxdy dxdy$

通过最小化粗糙度与误差测量之和，样条匹配具有最小量曲率的数据。Splines fit data with a minimum amount of curvature by minimizing the sum of roughness and error measurements.

薄板样条是以旋转不变性解决这个最小化问题的函数族。这个族可以表示为中心在数据点处的径向基函数加上定义平面的线性项之和。径向基函数是其在R2的值围绕原点径向对称的函数，因此用于薄板样条的径向基函数是拟合到位于{x_i}的n个控制点的薄板样条f(x)具有形式：Thin plate splines are a family of functions that solve this minimization problem with rotation invariance. This family can be expressed as the sum of radial basis functions centered at the data points plus a linear term defining the plane. radial basis function is a function whose value at R2 is radially symmetric around the origin, so The radial basis functions used for thin plate splines are A thin plate spline f(x) fitted to n control points located at { _xi } has the form:

$f f ((x x)) = = ax ax + + by by + + c c + + \underset{i i}{Σ Σ} {k k}_{i i} φ φ ((x x - - {x x}_{i i}))$

其中a、b、c和k_i是一组n+3个常数。Where a, b, c and _ki are a set of n+3 constants.

薄板样条是折衷误差与粗糙度的通用平滑函数。通过允许关于误差测量的权重接近1而且关于粗糙度测量的权重接近0，可以恢复严格的插值。这等价于只试图最小化粗糙度，而误差为零。对这个更窄问题的通用解也是薄板样条。(见2008年http://mathworld.wolfram.com/ThinPlateSpline.html上由Serge Belongie所写的“Thin Plate Splines”，该文献通过引用并入于此。)找出用于给定数据集的恒定权重的特定问题可以降至确定的线性等式系统。(见2006年MathWorks公司由Carl de Boor所写的“Splines ToolboxUser’s Guide”，该文献通过引用并入于此。)以下讨论使用严格插值薄板样条的原因。Thin plate splines are general smoothing functions that compromise error and roughness. Strict interpolation can be restored by allowing weights on the error measure to approach 1 and weights on the roughness measure to approach 0. This is equivalent to only trying to minimize the roughness with zero error. A general solution to this narrower problem is also thin-plate splines. (See "Thin Plate Splines" by Serge Belongie at http://mathworld.wolfram.com/ThinPlateSpline.html in 2008, which is hereby incorporated by reference.) Finding the constant The specific problem of weights can be reduced to a definite system of linear equations. (See "Splines Toolbox User's Guide" by Carl de Boor, MathWorks, Inc., 2006, which is hereby incorporated by reference.) The reasons for using strictly interpolating thin plate splines are discussed below.

尽管薄板样条最初是为标量数据设计的，但是它们可以一般化到矢量数据值。通过独立地假设数据行为的两个维度，每个坐标可以利用其自己的独立标量薄板样条函数来建模。当在图像处理应用中使用这种薄板样条时，这是通常所采用的方法。(见1999年Advances inComputational Mathematics第11卷第211-227页由Cedric A.Zala和Ian Barrodale所写的“Warping Aerial Photographs to OrthomapsUsing Thin Plate Splines”，该文献通过引用并入于此。)通过使用薄板样条插值用于所有其它点的映射，从一个两维图像到另一个的映射可以唯一地由某个控制点定义，其中该控制点在两个图像中的位置都是已知的。这些控制点是由优化问题找到的。为输入图像中的x和y坐标生成两个薄板样条，然后在输出图像中的每个点处评估，以便找到输入图像中对应的像素。Although thin plate splines were originally designed for scalar data, they can be generalized to vector data values. By independently assuming two dimensions of data behavior, each coordinate can be modeled with its own independent scalar thin-plate spline function. This is the usual approach when using such thin plate splines in image processing applications. (See "Warping Aerial Photographs to Orthomaps Using Thin Plate Splines" by Cedric A. Zala and Ian Barrodale, Advances in Computational Mathematics, Vol. 11, pp. 211-227, 1999, which is hereby incorporated by reference.) By using thin plate Spline interpolation is used for the mapping of all other points, and the mapping from one two-dimensional image to another can be uniquely defined by some control point, where the position of the control point in both images is known. These control points are found by an optimization problem. Generates two thin-plate splines for x and y coordinates in the input image, then evaluates at each point in the output image in order to find the corresponding pixel in the input image.

因为输入和输出图像中的控制点是相同的数据类型，即在R2中的点，所以有可能使用薄板样条来定义任何一个方向中的变换。在前向映射(forward mapping)处理中，输入图像中的控制点可以用作数据站，而输出图像中的控制点可以是数据值。评估输入图像中一个像素处的薄板样条，就可以获得该像素映射到输出图像中的位置。当其用于离散的图像矩阵时，这种变换可能会有问题。总得来说，所有的输出位置都会是不合理的实数，而不是整数，因此精确的像素对应性将是不清楚的。更重要的是，如果变换挤压或者拉伸了输入图像，则几个像素可能映射到相同的点，或者输出图像中的几个区域可能落在由原始映射的像素之间。Because the control points in the input and output images are of the same data type, i.e. points in R2, it is possible to use thin plate splines to define transformations in either direction. In the forward mapping process, the control points in the input image can be used as data stations, while the control points in the output image can be data values. Evaluating the thin-plate spline at a pixel in the input image yields where that pixel maps to in the output image. This transformation can be problematic when it is used with discrete image matrices. In general, all output positions will be unreasonably real numbers, not integers, so exact pixel correspondence will be unclear. More importantly, if the transformation squeezes or stretches the input image, several pixels may map to the same point, or several regions in the output image may fall between pixels mapped by the original.

在本实施例中，采用逆映射而不是前向映射来避免在输出图像中具有未定义像素的问题。在逆映射处理中，输出图像中的控制点是数据站，而输入图像中的控制点是数据值。评估输出图像中像素位置处的薄板样条会返回输入图像中从其映射的像素。非整数答案可以解释为四个围绕的整数点的距离加权平均值。因为图像矩阵中的每个像素都可以从一个薄板样条评估中明确地定义，所以一旦获得了样条函数，产生输出图像就是直接的。In this embodiment, inverse mapping is used instead of forward mapping to avoid the problem of having undefined pixels in the output image. In the inverse mapping process, the control points in the output image are data stations, while the control points in the input image are data values. Evaluating the thin-plate spline at the pixel location in the output image returns the pixel mapped from it in the input image. Non-integer answers can be interpreted as distance-weighted averages of the four surrounding integer points. Because each pixel in the image matrix can be unambiguously defined from a thin-plate spline evaluation, once the spline function is obtained, producing the output image is straightforward.

对于大量的控制点，产生并评估薄板样条在计算上会是繁重的。有些方法可以用于加速这个处理，当用于文本文档时，这些方法对结果产生的图像有最小的影响。第一种方法是通过将图像分成多个块并为每个块产生单独的薄板样条函数来减少每个薄板样条的控制点个数。图像可以分成尺寸递归地变化的多个块，以便限制每个样条中控制点的最大个数。运行时间对这个参数不是非常敏感。然而，当控制点的个数超过728时，Matlab使用一种慢得多的迭代算法(见2006年MathWorks公司由Carl de Boor所写的“Splines Toolbox User’sGuide”，该文献通过引用并入于此)。在本实施例中，控制点的最大个数限定到500。Generating and evaluating thin plate splines can be computationally heavy for a large number of control points. There are methods that can be used to speed up this process that, when used on text documents, have minimal impact on the resulting image. The first approach is to reduce the number of control points per thin-plate spline by dividing the image into blocks and generating a separate thin-plate spline function for each block. The image can be divided into blocks of recursively varying size in order to limit the maximum number of control points in each spline. Runtime is not very sensitive to this parameter. However, when the number of control points exceeds 728, Matlab uses a much slower iterative algorithm (see "Splines Toolbox User's Guide" by Carl de Boor, MathWorks, Inc., 2006, which is incorporated by reference at this). In this embodiment, the maximum number of control points is limited to 500.

图像的每个部分都被展开，而且这些部分联系到一起，形成完整的输出图像。总得来说，当以这种方式使用时，薄板样条在边界是不连续的。然而，优化模型创建了趋于整洁地对齐的段。对每一块的展开使用来自其面积大约为实际输出图像的面积两倍大的区域的控制点。由于控制点在一块文本上很均匀地隔开，因此两个相邻的段将共享靠近它们公共边界的大量控制点。通过要求薄板样条严格地插值拟合，两个变换在这个边界的邻域对应得非常好。尽管不是精确的对应，但其差别通常远小于一个像素，从而在输出图像中不会产生可见的假象。Each part of the image is expanded and these parts are linked together to form the complete output image. In general, thin plate splines are discontinuous at boundaries when used in this manner. However, optimizing the model creates segments that tend to align neatly. The unwrapping of each patch uses control points from a region whose area is about twice as large as that of the actual output image. Since the control points are fairly evenly spaced across a piece of text, two adjacent segments will share a large number of control points near their common border. The two transformations correspond very well in the neighborhood of this boundary by requiring the thin-plate spline to be strictly interpolated. Although not an exact correspondence, the difference is usually much smaller than a pixel so that no artifacts are visible in the output image.

如果进一步的测试显示段本身没有正确地对齐，则有可能通过使用来自一个段的样本作为用于另一个的控制点来强制它们这么做。沿另一个段的边界以规则的间隔评估一个段的薄板样条，并使用结果作为用于第二个段的控制点，将使得两个函数在所采样的点上精确地一致，而且插值应当使它们沿整个边界匹配。这么做的一个潜在缺点是其结果可能依赖于段展开的次序。两个段具有不同的展开，但只有它们中的一个被改变以便与另一个相配，因此次序将影响输出图像。另一个选项是研究标准的图像拼接(image-mosaicking)算法。这些算法中的大部分也使用薄板样条算法，因此它们有可能实现为段变换的一部分，而不是作为后处理效果来实现。If further testing shows that the segments themselves are not aligned correctly, it is possible to force them to do so by using samples from one segment as control points for the other. Evaluating the thin plate splines of one segment at regular intervals along the boundary of another segment, and using the results as control points for the second segment, will make the two functions coincide exactly at the points sampled, and the interpolation should Make them match along the entire border. A potential disadvantage of doing this is that the result may depend on the order in which the segments are expanded. The two segments have different expansions, but only one of them is changed to match the other, so the order will affect the output image. Another option is to investigate standard image-mosaicking algorithms. Most of these algorithms also use thin-plate splines, so it is possible that they are implemented as part of segment transformations rather than as post-processing effects.

第二个改进只影响薄板样条的评估，而不影响生成。对n个控制点评估薄板样条需要找到n个欧几里得距离和n个对数。对图像中的每单个像素执行这种计算是极慢的。这可以被忽略。如果文档变形不太严重，则薄板样条将也不会有剧烈的局部变化。评估薄板样条的结果是有序对的栅格，示出了像素应当从原始图像中的什么地方采样。这种栅格的准确近似可以通过每几个像素评估薄板样条并在栅格剩余部分用简单的线性插值填充来获得。在实践当中，变换足够简单，使得局部线性近似对于邻域的几个像素是准确的。每十个像素采样薄板样条将必要的样条评估的个数减少了两个数量级，而对正常的文本文档没有明显的可见假象。由于十个像素大约是可识别字符的最小值，而且特征检测步骤假定曲率大于单个字符，因此这种近似不应当不利地影响展开。通过组合这两种优化，在Matlab中利用一到二分钟左右的运行时间，可以对标准大小的图像获得薄板样条变换。The second improvement only affects the evaluation of thin plate splines, not the generation. Evaluating a thin plate spline for n control points requires finding n Euclidean distances and n logarithms. Performing this calculation on every single pixel in the image is extremely slow. This can be ignored. If the document deformation is not too severe, the thin plate spline will also not have drastic local changes. The result of evaluating the thin plate spline is a grid of ordered pairs showing where in the original image pixels should be sampled from. An accurate approximation of such a grid can be obtained by evaluating thin-plate splines every few pixels and filling the rest of the grid with simple linear interpolation. In practice, the transformation is simple enough that a local linear approximation is accurate for a neighborhood of a few pixels. Sampling thin-plate splines every ten pixels reduces the number of necessary spline evaluations by two orders of magnitude without noticeable visible artifacts on normal text documents. Since ten pixels is approximately the minimum for a recognizable character, and the feature detection step assumes curvature larger than a single character, this approximation should not adversely affect unwrapping. By combining these two optimizations, the thin-plate spline transformation can be obtained for standard-sized images with a runtime of one to two minutes or so in Matlab.

利用优化方法展开的样本图像280在图11中示出。控制点286以深色的点标记，而将被水平对齐的点282、288的那些集合以浅色的点标记。这个图像280包含具有高密度的左右对齐文本的那种文档。A sample image 280 developed using the optimization method is shown in FIG. 11 . Control points 286 are marked with dark colored dots, while those sets of points 282, 288 to be horizontally aligned are marked with light colored dots. This image 280 contains the type of document that has a high density of left-right justified text.

如图12中所示，优化展开方法的输出214应用到样本图像。文本行已经大部分拉直，而且列也左右对齐了。对齐中的不完美是由于我们对准的点是以不一定对每行都一致的方式位于第一个和最后一个文本字符中的某个地方的事实。我们拟合到列边界的样条可以用于获得要对齐的更好的点集。As shown in Figure 12, the output 214 of the optimized unwrapping method is applied to the sample image. The lines of text are mostly straightened, and the columns are aligned left and right. The imperfection in the alignment is due to the fact that the point we are aligning is somewhere in the first and last text character in a way that is not necessarily consistent for every line. The splines we fit to the column boundaries can be used to get a better set of points to align.

对栅格建立和展开步骤132有几种其它的可选方法。一种可选方法是对整个图像应用一系列基本变换，以便校正各种类型的卷曲。这种方法将允许控制要应用哪种变换，从而确切地指定我们应当校正什么类型的卷曲。然而，这也是有限制的，因为只有在原始变形可以表示为这些基本变换的某种组合的时候，图像才可以被校正。为了更平滑的展开，这种方法还可以迭代地应用。There are several other alternatives to the grid building and unfolding step 132 . An alternative approach is to apply a series of basic transformations to the entire image in order to correct various types of warping. This approach will allow control over which transformation is applied, specifying exactly what type of curl we should correct. However, this is also limited, since an image can only be rectified if the original deformation can be expressed as some combination of these fundamental transformations. This method can also be applied iteratively for smoother unrolling.

另一种可选方法是跨整个页面在文本行样条之间拟合样条，利用样条来采样用于输出图像的像素。每个样条将表示输出图像中的一个水平的像素行。这种方法可以得益于利用样条之间的全局优化，使得样条彼此相对一致。Another alternative is to fit splines between text line splines across the entire page, using the splines to sample pixels for the output image. Each spline will represent a horizontal row of pixels in the output image. This approach can benefit from utilizing global optimization between splines such that the splines are relatively consistent with each other.

另一种可选方法是重新构造3D中的表面并利用诸如Brown和Seals中所讨论的质量弹簧系统的思想将表面变平。(见2004年10月IEEE Transactions on Pattern Analysis and Machine Intelligence第26卷第10期第1295-1306页由Michael S.BROWN和W.Brent SEALES所写的“Image Restoration of Arbitrarily Warped Document”，该文献通过引用并入于此。)Another alternative is to reconstruct the surface in 3D and flatten the surface using ideas such as the mass-spring system discussed in Brown and Seals. (See "Image Restoration of Arbitrarily Warped Document" written by Michael S. BROWN and W. Brent SEALES, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 10, No. 1295-1306, October 2004, via incorporated herein by reference.)

在此所述用于处理所捕捉到的图像的方法可以应用到任何类型的处理应用，而且(无限制地)尤其很好地适于基于计算机的用于处理所捕捉到的图像的应用。在此所述的方法可以以硬件电路、计算机软件或者硬件电路与计算机软件的组合实现，而且不限于特定的硬件或软件实现。The methods described herein for processing captured images may be applied to any type of processing application, and (without limitation) are particularly well suited for computer-based applications for processing captured images. The methods described herein can be implemented by hardware circuits, computer software, or a combination of hardware circuits and computer software, and are not limited to specific hardware or software implementations.

图13是例示计算机系统1300的框图，以上所述本发明的实施例可以在该系统上实现。计算机系统1300包括用于传送信息的总线1345或者其它通信机构，及与总线1345耦接的用于处理信息的处理器1335。计算机系统1300还包括耦接到总线1345的用于存储信息和要由处理器1335执行的指令的主存储器1320，例如随机存取存储器(RAM)或者其它动态存储设备。主存储器1320还可用于存储在执行要由处理器1335执行的指令期间的临时变量或其他中间信息。计算机系统1300还包括耦接到总线1345的用于存储用于处理器1335的静态信息和指令的只读存储器(ROM)1325或者其它静态存储设备。存储设备1330(例如，磁盘或者光盘)被提供并耦接到总线1345，用于存储信息和指令。FIG. 13 is a block diagram illustrating a computer system 1300 upon which embodiments of the invention described above may be implemented. Computer system 1300 includes a bus 1345 or other communication mechanism for communicating information, and a processor 1335 coupled with bus 1345 for processing information. Computer system 1300 also includes main memory 1320 , such as a random access memory (RAM) or other dynamic storage device, coupled to bus 1345 for storing information and instructions to be executed by processor 1335 . Main memory 1320 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1335 . Computer system 1300 also includes a read only memory (ROM) 1325 or other static storage device coupled to bus 1345 for storing static information and instructions for processor 1335 . A storage device 1330 (eg, a magnetic or optical disk) is provided and coupled to bus 1345 for storing information and instructions.

计算机系统1300可以通过总线1345耦接到用于向计算机用户显示信息的显示器1305(例如，阴极射线管(CRT))。包括字母数字和其它键的输入设备1310耦接到总线1345，用于向处理器1335传送信息和命令选择。另一种类型的用户输入设备是游标控制器1315，例如鼠标、轨迹球或者游标方向键，用于向处理器1335传送方向信息和命令选择，还用于控制显示器1305上的游标运动。这种输入设备一般在两个轴(第一个轴(例如，x)和第二个轴(例如，y))中有两个自由度，从而允许设备指定平面中的位置。Computer system 1300 can be coupled by bus 1345 to a display 1305 (eg, a cathode ray tube (CRT)) for displaying information to a computer user. An input device 1310 including alphanumeric and other keys is coupled to bus 1345 for communicating information and command selections to processor 1335 . Another type of user input device is a cursor controller 1315 , such as a mouse, trackball, or cursor arrow keys, for communicating direction information and command selections to the processor 1335 and for controlling cursor movement on the display 1305 . Such input devices typically have two degrees of freedom in two axes, a first axis (eg, x) and a second axis (eg, y), allowing the device to specify a position in a plane.

在此所述的方法关于计算机系统1300对处理所捕捉到的图像的使用。根据一个实施例，对所捕捉到的图像的处理是由计算机系统1300响应于处理器1335执行主存储器1320中所包含的一个或多个指令的一个或多个序列来提供的。这种指令可以从另一个计算机可读介质(例如，存储设备1330)读到主存储器1320中。主存储器1320中所包含的指令序列的执行使得处理器1335执行在此所述的处理步骤。也可以采用多处理布置中的一个或多个处理器，来执行主存储器1320中所包含的指令序列。在可选实施例中，硬连线的电路可以代替或者与软件指令组合使用，来实现在此所述的实施例。因此，在此所述的实施例不限于硬件电路与软件的任何特定组合。The methods described herein relate to the use of computer system 1300 to process captured images. Processing of captured images is provided by computer system 1300 in response to processor 1335 executing one or more sequences of one or more instructions contained in main memory 1320, according to one embodiment. Such instructions may be read into main memory 1320 from another computer-readable medium (eg, storage device 1330 ). Execution of the sequences of instructions contained in main memory 1320 causes processor 1335 to perform the process steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the sequences of instructions contained in main memory 1320 . In alternative embodiments, hard-wired circuitry may be used instead of or in combination with software instructions to implement the embodiments described herein. Thus, embodiments described herein are not limited to any specific combination of hardware circuitry and software.

在此所使用的术语“计算机可读介质”指参与向处理器1335提供指令用以执行的任何介质。这种介质可以采取许多形式，包括但不限于非易失性介质、易失性介质和传输介质。非易失性介质包括例如光盘或磁盘，例如存储设备1330。易失性介质包括动态存储器，例如主存储器1320。传输介质包括同轴线缆、铜线和光纤，包括包含总线1345的电线。传输介质还可以采取声波或光波的形式，例如在无线电波和红外线数据通信过程中所产生的那些。The term "computer-readable medium" is used herein to refer to any medium that participates in providing instructions to processor 1335 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage device 1330 . Volatile media includes dynamic memory, such as main memory 1320 . Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1345 . Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

计算机可读介质的普通形式包括例如软盘、柔性盘、硬盘、磁带或者任何其它磁性介质、CD-ROM、任何其它光学介质、穿孔卡片、纸带、任何其它具有孔图案的物理介质、RAM、PROM和EPROM、FLASH-EPROM、任何其它存储器芯片或盒式磁带、如下文所述的载波或者任何其它计算机可以读取的介质。Common forms of computer readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic media, CD-ROMs, any other optical media, punched cards, paper tape, any other physical media with a pattern of holes, RAM, PROM and EPROM, FLASH-EPROM, any other memory chip or cartridge, carrier wave as described below, or any other computer-readable medium.

各种形式的计算机可读介质可以涉及将一个或多个指令的一个或多个序列载带到处理器1335用以执行。例如，指令最初可以在远程计算机的磁盘上携带。远程计算机可以将指令加载到其动态存储器中，并利用调制解调器经电话线发送该指令。计算机系统1300本地的调制解调器可以在电话线上接收数据，并使用红外线发送器将数据转换成红外线信号。耦接到总线1345的红外线检测器可以接收红外线信号中所载带的数据并将数据放到总线1345上。总线1345将数据带到主存储器1320，处理器1335从主存储器1320检索并执行指令。由主存储器1320接收的指令可以可选地在被处理器1335执行之前或者之后存储在存储设备1330上。Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 1335 for execution. For example, the instructions may initially be carried on a disk in the remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1300 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infrared detector coupled to bus 1345 can receive the data carried in the infrared signal and place the data on bus 1345 . Bus 1345 brings the data to main memory 1320 , from which processor 1335 retrieves and executes the instructions. The instructions received by main memory 1320 can optionally be stored on storage device 1330 either before or after execution by processor 1335 .

计算机系统1300还包括耦接到总线1345的通信接口1340。通信接口1340提供耦接到网络链路1375的双向数据通信，其中网络链路1375连接到本地网络1355。例如，通信接口1340可以是综合服务数字网(ISDN)卡或者调制解调器，以便提供到对应类型电话线的数据通信。作为另一个例子，通信接口1340可以是局域网(LAN)卡，以便提供到兼容LAN的数据通信连接。也可以实现无线链路。在任何这种实现中，通信接口1340都发送和接收载带表示各种类型信息的数字数据流的电、电磁或光信号。Computer system 1300 also includes a communication interface 1340 coupled to bus 1345 . Communication interface 1340 provides bi-directional data communication coupled to network link 1375 , which connects to local network 1355 . For example, communication interface 1340 may be an Integrated Services Digital Network (ISDN) card or a modem to provide data communication to a corresponding type of telephone line. As another example, communication interface 1340 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1340 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

网络链路1375一般提供通过一个或多个网络到其它数据服务的数据通信。例如，网络链路1375可以提供通过本地网络1355到宿主计算机1350或者到由网络服务提供商(ISP)1365所运营的数据设备的连接。ISP 1365又提供通过全球分组数据通信网络(其通常称为“互联网”1360)的数据通信服务。本地网络1355和互联网1360都使用电、电磁或光信号来载带数字数据流。通过各种网络的信号和网络链路1375上及通过通信接口1340的信号(它们载送数字数据到计算机系统1300，并载送来自计算机系统1300的数字数据)是运输信息的载波的示例形式。Network link 1375 generally provides data communication over one or more networks to other data services. For example, network link 1375 may provide a connection through local network 1355 to host computer 1350 or to a data device operated by Internet Service Provider (ISP) 1365 . The ISP 1365, in turn, provides data communication services over a global packet data communication network, commonly referred to as the "Internet" 1360. Local network 1355 and Internet 1360 both use electrical, electromagnetic or optical signals to carry digital data streams. The signals through the various networks and the signals on network link 1375 and through communication interface 1340 that carry the digital data to and from computer system 1300 are example forms of carrier waves that carry the information.

计算机系统1300可以通过网络、网络链路1375和通信接口1340发送消息并接收数据，包括程序代码。在互联网例子中，服务器1370可能通过互联网1360、ISP 1365、本地网络1355和通信接口1340发送用于应用程序的所请求代码。如下所述，根据本发明，一个这样的下载应用程序用于处理所捕捉的图像。Computer system 1300 can send messages and receive data, including program code, over a network, network link 1375 and communication interface 1340 . In the Internet example, the server 1370 may send the requested code for the application through the Internet 1360, ISP 1365, local network 1355, and communication interface 1340. As described below, one such download application is used to process captured images in accordance with the present invention.

所接收到的代码可以在其被接收的时候由处理器1335处理，和/或存储在存储设备1330或者其它非易失性存储器中用以日后执行。以这种方式，计算机系统1300可以获得载波形式的应用代码。Received code may be processed by processor 1335 as it is received, and/or stored in storage device 1330 or other non-volatile memory for later execution. In this manner, computer system 1300 can obtain the application code in the form of a carrier wave.

尽管已经使用例子公开了本发明，包括最佳模式，而且例子还使得本领域任何技术人员都可以制造并使用本发明，但是本发明的专利范围是由权利要求定义的，而且可以包括本领域技术人员可以想到的其它例子。因此，在此所公开的例子被认为是非限制性的。实际上，预期在此所公开的特征的任何组合都可以无限制地与在此所公开的其它特征的任何其它组合相组合。While the invention has been disclosed using examples, including the best mode, and the examples enable any person skilled in the art to make and use the invention, the patentable scope of the invention is defined by the claims and may include those skilled in the art Other examples can come to mind. Accordingly, the examples disclosed herein are to be considered non-limiting. In fact, it is contemplated that any combination of features disclosed herein may be combined without limitation with any other combination of other features disclosed herein.

此外，尽管为了清晰而借助于特定的术语，但本发明不是要限定到这么选择的特定术语，而且应当理解每个特定术语都包括所有的等同物。Furthermore, although specific terms are resorted to for the sake of clarity, the invention is not intended to be limited to the specific terms so selected, and it is to be understood that each specific term includes all equivalents.

还应当理解，在此所述的图像处理可以在软件或硬件中体现，而且可以通过能够进行在此所述的对捕捉图像的处理的计算机系统实现。It should also be understood that the image processing described herein may be embodied in software or hardware, and may be implemented by a computer system capable of processing captured images as described herein.

Claims

1. for the treatment of a method for the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with vertical stroke, and the method comprises:

(a) utilize the normalized threshold process of pixel to carry out the dualization to digital picture, to form the pixel of the text of document in discriminating digit image;

(b) detect indication text towards typesetting feature;

(c) top and the bottom to line of text by spline-fitting;

(d) utilize vector with the direction of the vertical stroke parallel vector parallel with the direction of line of text to set up tetragonal grid;

(e) vector that the vector that makes to be parallel to line of text by stretching image is parallel to the direction of the vertical stroke quadrature that becomes, launches document; And

(f) utilize optical character recognition to process the document of expansion.

2. the method for claim 1, wherein dualization is processed and is comprised that illusion removes step, that is and, if the join domain of a black pixel surpasses maximum area parameter, this illusion is removed the join domain that step abandons whole black pixel.

3. the method for claim 1, wherein dualization is processed and is comprised that illusion removes step, that is and, if the join domain of a black pixel is less than minimum area parameter, this illusion is removed the join domain that step abandons whole black pixel.

4. for the treatment of a method for the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with vertical stroke and top end and bottom end points, and the method comprises:

(a) detect top end and the bottom end points of line of text;

(b), for each line of text, spline-fitting is arrived to top end, and spline-fitting is arrived to bottom end points;

(c) by distinguishing top section and the base section of line of text, the page of determining the image of taking pictures towards;

(d) for each line of text calculate approximate towards, and remove the exceptional value in line of text;

(e) by determining that whether starting point or the terminal of line of text aligns, and finds out vertical paragraph boundaries;

(f) by detecting the vertical stroke in text character along partial vertical scanning direction, to obtain vertical pixel piece at each joining place of the barycenter batten of line of text and the text pixel of text character;

(g) utilize vector with the direction of the vertical stroke parallel vector parallel with the direction of line of text, set up tetragonal grid; And

(h) vector that the vector that makes to be parallel to line of text by stretching image is parallel to the direction of the vertical stroke quadrature that becomes, launches document.

5. method as claimed in claim 4, wherein by distinguish the page that the top section of line of text and base portion assign to determine the image of taking pictures towards step also comprise: selecting the representative sample of line of text and each line of text in sample is detected to which side has more exceptional value, and wherein the length of sample approaches the intermediate value length of all line of text.

6. for the treatment of a method for the image of taking pictures that comprises imaged document, wherein imaged document comprises line of text, and line of text comprises the text character with vertical stroke, and the method comprises:

(a) detect indication imaged document Chinese version towards typesetting feature;

(b) top and the bottom to the one or more line of text in imaged document by spline-fitting;

(c) utilize vector with the direction of the vertical stroke parallel vector parallel with the direction of line of text to set up tetragonal grid; And

(d) by each location of pixels in the output image for launching, calculate the correspondence position in its imaged document in the image of taking pictures, and calculate its pixel color and/or intensity, the imaged document of launching to take pictures in image by using near one or more pixels of this correspondence position in imaged document.

7. method as claimed in claim 6, the described correspondence position in the imaged document of wherein taking pictures in image in step (d) is by utilizing its x coordinate of a mathematical function modeling and calculating with its y coordinate of another mathematical function modeling.

8. method as claimed in claim 7, wherein these two mathematical functions utilize thin plate spline technology to produce.

9. method as claimed in claim 6, wherein also will generate reference mark before the calculating for the correspondence of each location of pixels, and wherein correspondence is to calculate for the subset of location of pixels.

10. method as claimed in claim 9, wherein the subset of location of pixels comprises the one or more points that are positioned in one or more line of text.

11. methods as claimed in claim 9, wherein the subset of location of pixels comprises left terminal and the right terminal of one or more line of text.

12. methods as claimed in claim 6, wherein the color of output pixel and/or intensity are that four nearest pixels are calculated from input picture.

13. 1 kinds of methods for the treatment of the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with end points and vertical stroke, the method comprises:

(a) by finding out corresponding to the set of pixels of text character in digital picture and creating the binary picture that only comprises described set of pixels, detect text filedly, wherein this set of pixels is grouped into character zone, and character zone is grouped into again line of text;

(b) by the end points of identification text character and the shape that vertical stroke detects the document of taking pictures in digital picture;

(c) by distinguish the top section of line of text and base portion assign to detect the document of taking pictures in digital picture towards; And

(d) based on grid, set up and process the new digital picture that digital service unit is become to this document of taking pictures, at grid, set up the end points and the vertical stroke that in processing, identify and be used as the identification curling basis of document.

14. methods as claimed in claim 13, wherein detect shape step and spline-fitting are arrived to top and the bottom of line of text, so that approximate original document shape.

15. methods as claimed in claim 13, wherein detect text filed step further comprising the steps of:

(a1) by threshold process method standard and/or simple, estimate prospect text;

(a2) from original image, remove these foreground pixels;

(a3) by carrying out interpolation from remaining value, fill the hole staying due to removal, this by remove initial threshold process and on hole interpolation the new estimation to background is provided; And

(a4) threshold process is carried out in the improved estimation based on background.

16. methods as claimed in claim 13, wherein shift step depends on grid and sets up processing, sets up the feature extracting in processing be used as the identification curling basis of document at grid.

17. methods as claimed in claim 13, wherein shift step depends on optimization problem.

18. 1 kinds of computer systems for the treatment of the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with vertical stroke, this computer system comprises:

Be used for utilizing the normalized threshold process of pixel to carry out dualization, to form the device of pixel of the text of document in recognition image;

For detection of indication text towards the device of typesetting feature;

For spline-fitting is arrived to the top of line of text and the device of bottom;

For utilizing the vector parallel with the direction of the line of text vector parallel with the direction of vertical stroke to set up the device of tetragonal grid;

The vector that is parallel to the direction of vertical stroke for make to be parallel to the vector of line of text by the stretching image quadrature that becomes, launches the device of document; And

For utilizing optical character recognition to process the device of the document of expansion.

19. 1 kinds of computer systems for the treatment of the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with vertical stroke, this computer system comprises:

For detection of the top end of line of text and the device of bottom end points;

For for each line of text, spline-fitting is arrived to top end, and spline-fitting is arrived to the device of bottom end points;

For by distinguishing top section and the base section of line of text, the page of determining the image of taking pictures towards device;

Be used to each line of text calculate approximate towards, and remove the device of the exceptional value in line of text;

For starting point or terminal by definite line of text, whether align, find out the device of vertical paragraph boundaries;

Be used for by detect the vertical stroke of text character along partial vertical scanning direction, to obtain the device of vertical pixel piece at each joining place of the barycenter batten of line of text and the text pixel of text character;

For utilizing the vector parallel with the direction of the line of text vector parallel with the direction of vertical stroke, set up the device of tetragonal grid; And

The vector that is parallel to the direction of vertical stroke for make to be parallel to the vector of line of text by the stretching image quadrature that becomes, launches the device of document.

20. 1 kinds of computer systems for the treatment of the digital picture of the document of taking pictures that comprises line of text, wherein line of text comprises the text character with vertical stroke, this computer system comprises:

For detecting text filed device by finding out corresponding to the set of pixels of text character in digital picture and creating the binary picture that only comprises described set of pixels, wherein this set of pixels is grouped into character zone, and character zone is grouped into again line of text;

For the end points by identification text character and the device that vertical stroke detects shape;

For by distinguish the top section of line of text and base portion assign to detect the document of taking pictures towards device; And

For setting up and process the device that digital service unit is become to the new digital picture of this document of taking pictures based on grid, the end points identifying in grid foundation is processed and vertical stroke are as the identification curling basis of document.