[go: up one dir, main page]

CN109344297A - A method for offline acquisition of in-version cataloging data of books in a shared book system - Google Patents

A method for offline acquisition of in-version cataloging data of books in a shared book system Download PDF

Info

Publication number
CN109344297A
CN109344297A CN201811085837.5A CN201811085837A CN109344297A CN 109344297 A CN109344297 A CN 109344297A CN 201811085837 A CN201811085837 A CN 201811085837A CN 109344297 A CN109344297 A CN 109344297A
Authority
CN
China
Prior art keywords
gray
value
pixel
title page
books title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811085837.5A
Other languages
Chinese (zh)
Inventor
蔡安
王勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201811085837.5A priority Critical patent/CN109344297A/en
Publication of CN109344297A publication Critical patent/CN109344297A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

The present invention provides a kind of methods for obtaining CIP data (CIP) in shared book system offline.Books title page picture is pre-processed first, so that the pixel of the composition text in picture mutually be separated with the pixel of composition background and remove the noise of disturbance ecology and then improve data acquisition accuracy.Then optical character identification is carried out to treated books title page picture, obtains the text information in picture.Finally numbered according to title, author, publishing house, Publication Year and the ISBN that the format character of CIP data parses books from text information.So that directly acquiring CIP data by sterogram title page page photo or e-book title page picture under the off-line state that shared book system is in not connected internet.

Description

The method of CIP data is obtained in a kind of shared book system offline
Technical field
The invention belongs to obtain books offline in version in data acquisition technology field more particularly to a kind of shared book system The method of catalogue data passes through processing entities books title page photo or e-book title page picture (following system further to one kind Referred to as books title page picture) in text information obtain CIP data method.
Background technique
China's national reading rate rises year by year, carrier of the books as knowledge and information, has the demand being shared and visitor Sight condition.The method that CIP data is obtained in existing shared book system is confined to manual entry or passes through interconnection Net is online to be obtained.Cause data acquisition collecting efficiency lower and error rate is high and needs to spend more human cost.
Summary of the invention
The technical problem to be solved by the present invention is to provide in a kind of shared book system and obtain cataloguing in publication number offline According to the method for (CIP), so that passing through sterogram title page page under the off-line state that shared book system is in not connected internet Photo or e-book title page picture directly acquire CIP data;The present invention is by carrying out image to books title page picture Processing passes through the text information in optical character recognition technology acquisition books title page and is parsed to it to realize the method.
This method first pre-processes books title page picture, thus by the pixel and group of the composition text in picture It is mutually separated at the pixel of background and removes the noise of disturbance ecology and then improve data acquisition accuracy.Then to by handling Books title page picture carry out optical character identification, obtain picture in text information.Finally according to CIP data Format character the titles of books, author, publishing house, Publication Year and ISBN number are parsed from text information.
To achieve the above object, the invention adopts the following technical scheme:
Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate Rate.
Step 2: optical character identification being carried out to treated books title page picture using Tesseract-OCR and obtains door leaf Text information in page.
Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model Attribute in looking for is matched and extracts CIP data.
Preferably, step 1 specifically:
Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel in books title page picture Coordinate be (x, y), if the pixel red color component value be R (x, y), green component values be G (x, y), blue color component value be B (x, y).Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains each pixel Gray value;
F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)
Step 1.2: according to step 1.1 gained pixel gray value, finding in all books title page picture pixels points most High-gray level value and minimum gradation value.Assuming that maximum gradation value is Graymax, minimum gradation value Graymin, binarization threshold T. Binaryzation initial threshold is calculated according to formula (2-2);
T (0)=(Graymin+Graymax)/2 (2-2)
Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).It will be grey The pixel that angle value is less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel Point.The average value Gray of all foreground pixel point gray values is found out respectivelyfWith the average value Gray of background pixel gray valueb, press Next threshold value is calculated according to formula (2-3);
T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
Step 1.4: comparing T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture Gray value and previous step obtained by threshold value T (n) be compared.Pixel by gray value less than T (n) ranges foreground pixel Point, the pixel by gray value greater than T (n) range background pixel.The flat of all foreground pixel point gray values is found out respectively Mean value GrayfWith the average value Gray of background pixel gray valueb, next threshold value is calculated according to formula (2-3);
Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page The gray value of all pixels point is compared with threshold value T in photo.The pixel gray value that will be greater than T is set as 255, will be less than The pixel gray value of T is set as 0;
Step 1.6: by the eight pixel ashes adjacent with surrounding for 0 pixel of each gray value in books title page picture Angle value is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its ash Angle value is set to 255.
Preferably, step 3 specifically:
The text information in books title page picture obtained by step 2 in the ideal case with regular expression (3-1) phase Matching:
([s S] *) [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [/-|]) d { 1,2 } [s S] * ISBN (.*)
Regular expression (3-1)
Match under the lower circumstances of books title page picture clarity with regular expression (3-2):
([s S] *) [/, ] (.*?) [write and write] .* [- mono- _] * .*? [: :] (.*?) [, ,] and (.*?) [s S] * [iI1 |] [Ss5] [B8] [NM] (.*)
Regular expression (3-2)
According to regular expressions, extracting in " () " includes in CIP data of the content as shared books Hold, from left to right successively are as follows: book name, editor's name, publishing house, Publication Year and ISBN number.
Each symbology what meaning in yellow flag part
In regular expression (3-1) and regular expression (3-2), each symbol meaning is as follows: s indicate include space, change All blank characters including row, retraction symbol;S indicate all non-blank-white characters;* indicate that the subexpression of matching front is any It is secondary;.*? it indicates to carry out all characters except newline and in addition to the carriage return character non-greedy matching;.* it indicates for except newline Greedy matching is carried out with all characters except the carriage return character;[] indicates character set, any included in matching " [" with "] " One character;D indicates one numerical character of matching, can be equivalent to [0-9];{ n } indicates the subexpression n times of matching front, Middle n is a determining nonnegative integer;() indicates that the character string that the part for including between (with) is matched to is required extraction Cataloguing in publication information.Except the above, include in regular expression (3-1) and regular expression (3-2) as volume, work, It writes, remaining symbol such as I, S, B, N indicates to match itself with character.
The present invention provides a kind of in the case where shared book system is not connected with the off-line case of internet, directly passes through books door leaf The method that page picture obtains CIP data.So that the book information in shared book system obtains in process, reduce Manual steps, save human cost.
Detailed description of the invention
Fig. 1 is the flow chart for obtaining CIP data in shared book system offline.
Fig. 2 is the example for obtaining CIP data in shared book system offline.
Specific embodiment
The present invention provides a kind of method for obtaining CIP data (CIP) in shared book system offline, such as Fig. 1 Shown, specific embodiment comprises the steps of:
Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate Rate.Specifically:
Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel in books title page picture Coordinate be (x, y), if the pixel red color component value be R (x, y), green component values be G (x, y), blue color component value be B (x, y).Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains each pixel Gray value;
F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)
Step 1.2: according to step 1.1 gained pixel gray value, finding in all books title page picture pixels points most High-gray level value and minimum gradation value.Assuming that maximum gradation value is Graymax, minimum gradation value Graymin, binarization threshold T. Binaryzation initial threshold is calculated according to formula (2-2);
T (0)=(Graymin+Graymax)/2 (2-2)
Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).It will be grey The pixel that angle value is less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel Point.The average value Gray of all foreground pixel point gray values is found out respectivelyfWith the average value Gray of background pixel gray valueb, press Next threshold value is calculated according to formula (2-3);
T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
Step 1.4: comparing T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture Gray value and previous step obtained by threshold value T (n) be compared.Pixel by gray value less than T (n) ranges foreground pixel Point, the pixel by gray value greater than T (n) range background pixel.The flat of all foreground pixel point gray values is found out respectively Mean value GrayfWith the average value Gray of background pixel gray valueb, next threshold value is calculated according to formula (2-3);
Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page The gray value of all pixels point is compared with threshold value T in photo.The pixel gray value that will be greater than T is set as 255, will be less than The pixel gray value of T is set as 0;
Step 1.6: by the eight pixel ashes adjacent with surrounding for 0 pixel of each gray value in books title page picture Angle value is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its ash Angle value is set to 255.
Step 2: optical character identification being carried out to treated books title page picture using Tesseract-OCR and obtains door leaf Text information in page.
Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model Attribute in looking for is matched and extracts CIP data.Text letter in the books title page picture obtained by step 2 Breath matches with regular expression (3-1) in the ideal case:
([s S]) * [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [- |/|]) d { 1,2 } [s S] * ISBN (.*)
Regular expression (3-1)
Match under the lower circumstances of books title page picture clarity with regular expression (3-2):
([s S] *) [/, ] (.*?) [write and write] .* [- mono- _] * .*? [: :] (.*?) [, ,] and (.*?) [s S] * [iI1 |] [Ss5] [B8] [NM] (.*)
Regular expression (3-2)
According to regular expressions, extracting in " () " includes in CIP data of the content as shared books Hold, from left to right successively are as follows: book name, editor's name, publishing house, Publication Year and ISBN number.
Embodiment 1
As shown in Fig. 2, left side is that user uploads books title page picture in figure, right side is that the method provided through the invention is adopted The CIP data collected.

Claims (4)

1. the method for CIP data is obtained in a kind of shared book system offline, which is characterized in that including following step It is rapid:
Step 1 pre-processes books title page picture, by the pixel of the pixel of the composition text in picture and composition background Point mutually separates and removes the noise of disturbance ecology;
Step 2 carries out optical character identification to by pretreated books title page picture, obtains the text information in picture;
Step 3, from extracting effective information in the text information that step 2 is got and being looked for according to CIP data model Attribute matched and extract CIP data.
2. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that in step 1 It includes: to carry out gray processing, binaryzation and noise reduction process to books title page photo that books title page picture, which carries out pretreatment,.
3. as claim 1 shares the method for obtaining CIP data (CIP) in book system, which is characterized in that step Optical character identification is carried out to treated books title page picture using Tesseract-OCR in rapid 2.
4. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that step 1 tool Body are as follows:
Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel coordinate in books title page picture For (x, y), if the pixel red color component value is R (x, y), green component values are G (x, y), and blue color component value is B (x, y).It presses Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains the gray scale of each pixel Value;
F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y); (2-1)
Step 1.2: according to step 1.1 gained pixel gray value, finding the maximum ash in all books title page picture pixels points Angle value and minimum gradation value, it is assumed that maximum gradation value Graymax, minimum gradation value Graymin, binarization threshold T.According to Formula (2-2) calculates binaryzation initial threshold;
T (0)=(Graymin+Graymax)/2; (2-2)
Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).By gray value Pixel less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel;Point The average value Gray of all foreground pixel point gray values is not found outfWith the average value Gray of background pixel gray valueb, according to public affairs Formula (2-3) calculates next threshold value;
T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
Step 1.4: compare T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture with it is upper Threshold value T (n) obtained by one step is compared.Pixel by gray value less than T (n) ranges foreground pixel point, and gray value is big Background pixel is ranged in the pixel of T (n).The average value Gray of all foreground pixel point gray values is found out respectivelyfWith it is rear The average value Gray of scape pixel gray valueb, next threshold value is calculated according to formula (2-3);
Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page photo The gray value of middle all pixels point is compared with threshold value T.The pixel gray value that will be greater than T is set as 255, will be less than T's Pixel gray value is set as 0;
Step 1.6: by eight pixel gray values adjacent with surrounding for 0 pixel of each gray value in books title page picture It is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its gray value It is set to 255.
CN201811085837.5A 2018-09-18 2018-09-18 A method for offline acquisition of in-version cataloging data of books in a shared book system Withdrawn CN109344297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811085837.5A CN109344297A (en) 2018-09-18 2018-09-18 A method for offline acquisition of in-version cataloging data of books in a shared book system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811085837.5A CN109344297A (en) 2018-09-18 2018-09-18 A method for offline acquisition of in-version cataloging data of books in a shared book system

Publications (1)

Publication Number Publication Date
CN109344297A true CN109344297A (en) 2019-02-15

Family

ID=65305408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811085837.5A Withdrawn CN109344297A (en) 2018-09-18 2018-09-18 A method for offline acquisition of in-version cataloging data of books in a shared book system

Country Status (1)

Country Link
CN (1) CN109344297A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254676A (en) * 2021-06-24 2021-08-13 广州卓一信息科技有限公司 Digital multimedia library management method and system
CN118609140A (en) * 2024-06-28 2024-09-06 上海阿法迪智能数字科技股份有限公司 Real-time collection and input method of book in-print cataloging data and book collection and editing equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060072830A1 (en) * 2004-02-26 2006-04-06 Xerox Corporation Method for automated image indexing and retrieval
CN102289668A (en) * 2011-09-07 2011-12-21 谭洪舟 Binaryzation processing method of self-adaption word image based on pixel neighborhood feature
CN103823454A (en) * 2014-03-12 2014-05-28 黄昱俊 System and method for inquiring and locating books based on machine vision
CN104598435A (en) * 2015-01-20 2015-05-06 青岛农业大学 Method for solving prepress catalog data localization
CN105354574A (en) * 2015-12-04 2016-02-24 山东博昂信息科技有限公司 Vehicle number recognition method and device
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060072830A1 (en) * 2004-02-26 2006-04-06 Xerox Corporation Method for automated image indexing and retrieval
CN102289668A (en) * 2011-09-07 2011-12-21 谭洪舟 Binaryzation processing method of self-adaption word image based on pixel neighborhood feature
CN103823454A (en) * 2014-03-12 2014-05-28 黄昱俊 System and method for inquiring and locating books based on machine vision
CN104598435A (en) * 2015-01-20 2015-05-06 青岛农业大学 Method for solving prepress catalog data localization
CN105354574A (en) * 2015-12-04 2016-02-24 山东博昂信息科技有限公司 Vehicle number recognition method and device
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254676A (en) * 2021-06-24 2021-08-13 广州卓一信息科技有限公司 Digital multimedia library management method and system
CN118609140A (en) * 2024-06-28 2024-09-06 上海阿法迪智能数字科技股份有限公司 Real-time collection and input method of book in-print cataloging data and book collection and editing equipment

Similar Documents

Publication Publication Date Title
Burie et al. ICDAR2015 competition on smartphone document capture and OCR (SmartDoc)
CN105678293A (en) Complex image and text sequence identification method based on CNN-RNN
Yousfi et al. ALIF: A dataset for Arabic embedded text recognition in TV broadcast
CN105184238A (en) Human face recognition method and system
CN105678292A (en) Complex optical text sequence identification system based on convolution and recurrent neural network
CN105678300A (en) Complex image and text sequence identification method
CN104268541A (en) Intelligent image identification method of device nameplate and energy efficiency label
CN102750379B (en) Fast character string matching method based on filtering type
Isheawy et al. Optical character recognition (OCR) system
Suryani et al. The handwritten sundanese palm leaf manuscript dataset from 15th century
CN103995904A (en) Recognition system for image file electronic data
CN113516041A (en) Tibetan ancient book document image layout segmentation and identification method and system
Sahu et al. An efficient handwritten Devnagari character recognition system using neural network
CN108108482B (en) Method for realizing scene reality enhancement in scene conversion
JP6882362B2 (en) Systems and methods for identifying images, including identification documents
CN109344297A (en) A method for offline acquisition of in-version cataloging data of books in a shared book system
Kesiman et al. ICFHR 2018 competition on document image analysis tasks for southeast asian palm leaf manuscripts
CN106776880A (en) A kind of paper based on picture and text identification reviews system and method
CN117763175A (en) Heterogeneous knowledge-fused multi-strategy image retrieval method and system
Alamsyah et al. Autoencoder image denoising to increase optical character recognition performance in text conversion
Antony et al. A framework for recognition of handwritten South Dravidian Tulu script
Thaker et al. Structural feature extraction to recognize some of the offline isolated handwritten Gujarati characters using decision tree classifier
Adyanthaya Text recognition from images: a study
CN110852359B (en) Genealogy recognition method and system based on deep learning
Ahmed et al. Enhancing the character segmentation accuracy of bangla ocr using bpnn

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190215

WW01 Invention patent application withdrawn after publication