CN109344297A

CN109344297A - A method for offline acquisition of in-version cataloging data of books in a shared book system

Info

Publication number: CN109344297A
Application number: CN201811085837.5A
Authority: CN
Inventors: 蔡安; 王勇
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2019-02-15

Abstract

The present invention provides a kind of methods for obtaining CIP data (CIP) in shared book system offline.Books title page picture is pre-processed first, so that the pixel of the composition text in picture mutually be separated with the pixel of composition background and remove the noise of disturbance ecology and then improve data acquisition accuracy.Then optical character identification is carried out to treated books title page picture, obtains the text information in picture.Finally numbered according to title, author, publishing house, Publication Year and the ISBN that the format character of CIP data parses books from text information.So that directly acquiring CIP data by sterogram title page page photo or e-book title page picture under the off-line state that shared book system is in not connected internet.

Description

The method of CIP data is obtained in a kind of shared book system offline

Technical field

The invention belongs to obtain books offline in version in data acquisition technology field more particularly to a kind of shared book system The method of catalogue data passes through processing entities books title page photo or e-book title page picture (following system further to one kind Referred to as books title page picture) in text information obtain CIP data method.

Background technique

China's national reading rate rises year by year, carrier of the books as knowledge and information, has the demand being shared and visitor Sight condition.The method that CIP data is obtained in existing shared book system is confined to manual entry or passes through interconnection Net is online to be obtained.Cause data acquisition collecting efficiency lower and error rate is high and needs to spend more human cost.

Summary of the invention

The technical problem to be solved by the present invention is to provide in a kind of shared book system and obtain cataloguing in publication number offline According to the method for (CIP), so that passing through sterogram title page page under the off-line state that shared book system is in not connected internet Photo or e-book title page picture directly acquire CIP data；The present invention is by carrying out image to books title page picture Processing passes through the text information in optical character recognition technology acquisition books title page and is parsed to it to realize the method.

This method first pre-processes books title page picture, thus by the pixel and group of the composition text in picture It is mutually separated at the pixel of background and removes the noise of disturbance ecology and then improve data acquisition accuracy.Then to by handling Books title page picture carry out optical character identification, obtain picture in text information.Finally according to CIP data Format character the titles of books, author, publishing house, Publication Year and ISBN number are parsed from text information.

To achieve the above object, the invention adopts the following technical scheme:

Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate Rate.

Step 2: optical character identification being carried out to treated books title page picture using Tesseract-OCR and obtains door leaf Text information in page.

Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model Attribute in looking for is matched and extracts CIP data.

Preferably, step 1 specifically:

Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel in books title page picture Coordinate be (x, y), if the pixel red color component value be R (x, y), green component values be G (x, y), blue color component value be B (x, y).Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains each pixel Gray value；

F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)

Step 1.2: according to step 1.1 gained pixel gray value, finding in all books title page picture pixels points most High-gray level value and minimum gradation value.Assuming that maximum gradation value is Gray_max, minimum gradation value Gray_min, binarization threshold T. Binaryzation initial threshold is calculated according to formula (2-2)；

T (0)=(Gray_min+Gray_max)/2 (2-2)

Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).It will be grey The pixel that angle value is less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel Point.The average value Gray of all foreground pixel point gray values is found out respectively_fWith the average value Gray of background pixel gray value_b, press Next threshold value is calculated according to formula (2-3)；

T (n)=(Gray_f+Gray_b)/2 (n=1,2,3 ...) (2-3)

Step 1.4: comparing T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture Gray value and previous step obtained by threshold value T (n) be compared.Pixel by gray value less than T (n) ranges foreground pixel Point, the pixel by gray value greater than T (n) range background pixel.The flat of all foreground pixel point gray values is found out respectively Mean value Gray_fWith the average value Gray of background pixel gray value_b, next threshold value is calculated according to formula (2-3)；

Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page The gray value of all pixels point is compared with threshold value T in photo.The pixel gray value that will be greater than T is set as 255, will be less than The pixel gray value of T is set as 0；

Step 1.6: by the eight pixel ashes adjacent with surrounding for 0 pixel of each gray value in books title page picture Angle value is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its ash Angle value is set to 255.

Preferably, step 3 specifically:

The text information in books title page picture obtained by step 2 in the ideal case with regular expression (3-1) phase Matching:

([s S] *) [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [/-|]) d { 1,2 } [s S] * ISBN (.*)

Regular expression (3-1)

Match under the lower circumstances of books title page picture clarity with regular expression (3-2):

([s S] *) [/, ] (.*?) [write and write] .* [- mono- _] * .*? [: :] (.*?) [, ,] and (.*?) [s S] * [iI1 |] [Ss5] [B8] [NM] (.*)

Regular expression (3-2)

According to regular expressions, extracting in " () " includes in CIP data of the content as shared books Hold, from left to right successively are as follows: book name, editor's name, publishing house, Publication Year and ISBN number.

Each symbology what meaning in yellow flag part

In regular expression (3-1) and regular expression (3-2), each symbol meaning is as follows: s indicate include space, change All blank characters including row, retraction symbol；S indicate all non-blank-white characters；* indicate that the subexpression of matching front is any It is secondary；.*? it indicates to carry out all characters except newline and in addition to the carriage return character non-greedy matching；.* it indicates for except newline Greedy matching is carried out with all characters except the carriage return character；[] indicates character set, any included in matching " [" with "] " One character；D indicates one numerical character of matching, can be equivalent to [0-9]；{ n } indicates the subexpression n times of matching front, Middle n is a determining nonnegative integer；() indicates that the character string that the part for including between (with) is matched to is required extraction Cataloguing in publication information.Except the above, include in regular expression (3-1) and regular expression (3-2) as volume, work, It writes, remaining symbol such as I, S, B, N indicates to match itself with character.

The present invention provides a kind of in the case where shared book system is not connected with the off-line case of internet, directly passes through books door leaf The method that page picture obtains CIP data.So that the book information in shared book system obtains in process, reduce Manual steps, save human cost.

Detailed description of the invention

Fig. 1 is the flow chart for obtaining CIP data in shared book system offline.

Fig. 2 is the example for obtaining CIP data in shared book system offline.

Specific embodiment

The present invention provides a kind of method for obtaining CIP data (CIP) in shared book system offline, such as Fig. 1 Shown, specific embodiment comprises the steps of:

Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate Rate.Specifically:

F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)

T (0)=(Gray_min+Gray_max)/2 (2-2)

T (n)=(Gray_f+Gray_b)/2 (n=1,2,3 ...) (2-3)

Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model Attribute in looking for is matched and extracts CIP data.Text letter in the books title page picture obtained by step 2 Breath matches with regular expression (3-1) in the ideal case:

([s S]) * [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [- |/|]) d { 1,2 } [s S] * ISBN (.*)

Regular expression (3-1)

Regular expression (3-2)

Embodiment 1

As shown in Fig. 2, left side is that user uploads books title page picture in figure, right side is that the method provided through the invention is adopted The CIP data collected.

Claims

1. the method for CIP data is obtained in a kind of shared book system offline, which is characterized in that including following step It is rapid:

Step 1 pre-processes books title page picture, by the pixel of the pixel of the composition text in picture and composition background Point mutually separates and removes the noise of disturbance ecology；

Step 2 carries out optical character identification to by pretreated books title page picture, obtains the text information in picture；

Step 3, from extracting effective information in the text information that step 2 is got and being looked for according to CIP data model Attribute matched and extract CIP data.

2. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that in step 1 It includes: to carry out gray processing, binaryzation and noise reduction process to books title page photo that books title page picture, which carries out pretreatment,.

3. as claim 1 shares the method for obtaining CIP data (CIP) in book system, which is characterized in that step Optical character identification is carried out to treated books title page picture using Tesseract-OCR in rapid 2.

4. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that step 1 tool Body are as follows:

Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel coordinate in books title page picture For (x, y), if the pixel red color component value is R (x, y), green component values are G (x, y), and blue color component value is B (x, y).It presses Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains the gray scale of each pixel Value；

F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y)； (2-1)

Step 1.2: according to step 1.1 gained pixel gray value, finding the maximum ash in all books title page picture pixels points Angle value and minimum gradation value, it is assumed that maximum gradation value Gray_max, minimum gradation value Gray_min, binarization threshold T.According to Formula (2-2) calculates binaryzation initial threshold；

T (0)=(Gray_min+Gray_max)/2； (2-2)

Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).By gray value Pixel less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel；Point The average value Gray of all foreground pixel point gray values is not found out_fWith the average value Gray of background pixel gray value_b, according to public affairs Formula (2-3) calculates next threshold value；

T (n)=(Gray_f+Gray_b)/2 (n=1,2,3 ...) (2-3)

Step 1.4: compare T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture with it is upper Threshold value T (n) obtained by one step is compared.Pixel by gray value less than T (n) ranges foreground pixel point, and gray value is big Background pixel is ranged in the pixel of T (n).The average value Gray of all foreground pixel point gray values is found out respectively_fWith it is rear The average value Gray of scape pixel gray value_b, next threshold value is calculated according to formula (2-3)；

Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page photo The gray value of middle all pixels point is compared with threshold value T.The pixel gray value that will be greater than T is set as 255, will be less than T's Pixel gray value is set as 0；

Step 1.6: by eight pixel gray values adjacent with surrounding for 0 pixel of each gray value in books title page picture It is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its gray value It is set to 255.