CN109344297A - A method for offline acquisition of in-version cataloging data of books in a shared book system - Google Patents
A method for offline acquisition of in-version cataloging data of books in a shared book system Download PDFInfo
- Publication number
- CN109344297A CN109344297A CN201811085837.5A CN201811085837A CN109344297A CN 109344297 A CN109344297 A CN 109344297A CN 201811085837 A CN201811085837 A CN 201811085837A CN 109344297 A CN109344297 A CN 109344297A
- Authority
- CN
- China
- Prior art keywords
- gray
- value
- pixel
- title page
- books title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Character Input (AREA)
Abstract
The present invention provides a kind of methods for obtaining CIP data (CIP) in shared book system offline.Books title page picture is pre-processed first, so that the pixel of the composition text in picture mutually be separated with the pixel of composition background and remove the noise of disturbance ecology and then improve data acquisition accuracy.Then optical character identification is carried out to treated books title page picture, obtains the text information in picture.Finally numbered according to title, author, publishing house, Publication Year and the ISBN that the format character of CIP data parses books from text information.So that directly acquiring CIP data by sterogram title page page photo or e-book title page picture under the off-line state that shared book system is in not connected internet.
    Description
Technical field
      The invention belongs to obtain books offline in version in data acquisition technology field more particularly to a kind of shared book system
The method of catalogue data passes through processing entities books title page photo or e-book title page picture (following system further to one kind
Referred to as books title page picture) in text information obtain CIP data method.
    Background technique
      China's national reading rate rises year by year, carrier of the books as knowledge and information, has the demand being shared and visitor
Sight condition.The method that CIP data is obtained in existing shared book system is confined to manual entry or passes through interconnection
Net is online to be obtained.Cause data acquisition collecting efficiency lower and error rate is high and needs to spend more human cost.
    Summary of the invention
      The technical problem to be solved by the present invention is to provide in a kind of shared book system and obtain cataloguing in publication number offline
According to the method for (CIP), so that passing through sterogram title page page under the off-line state that shared book system is in not connected internet
Photo or e-book title page picture directly acquire CIP data;The present invention is by carrying out image to books title page picture
Processing passes through the text information in optical character recognition technology acquisition books title page and is parsed to it to realize the method.
      This method first pre-processes books title page picture, thus by the pixel and group of the composition text in picture
It is mutually separated at the pixel of background and removes the noise of disturbance ecology and then improve data acquisition accuracy.Then to by handling
Books title page picture carry out optical character identification, obtain picture in text information.Finally according to CIP data
Format character the titles of books, author, publishing house, Publication Year and ISBN number are parsed from text information.
      To achieve the above object, the invention adopts the following technical scheme:
      Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate
Rate.
      Step 2: optical character identification being carried out to treated books title page picture using Tesseract-OCR and obtains door leaf
Text information in page.
      Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model
Attribute in looking for is matched and extracts CIP data.
      Preferably, step 1 specifically:
      Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel in books title page picture
Coordinate be (x, y), if the pixel red color component value be R (x, y), green component values be G (x, y), blue color component value be B (x,
y).Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains each pixel
Gray value;
      F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)
      Step 1.2: according to step 1.1 gained pixel gray value, finding in all books title page picture pixels points most
High-gray level value and minimum gradation value.Assuming that maximum gradation value is Graymax, minimum gradation value Graymin, binarization threshold T.
Binaryzation initial threshold is calculated according to formula (2-2);
      T (0)=(Graymin+Graymax)/2  (2-2)
      Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).It will be grey
The pixel that angle value is less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel
Point.The average value Gray of all foreground pixel point gray values is found out respectivelyfWith the average value Gray of background pixel gray valueb, press
Next threshold value is calculated according to formula (2-3);
      T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
      Step 1.4: comparing T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture
Gray value and previous step obtained by threshold value T (n) be compared.Pixel by gray value less than T (n) ranges foreground pixel
Point, the pixel by gray value greater than T (n) range background pixel.The flat of all foreground pixel point gray values is found out respectively
Mean value GrayfWith the average value Gray of background pixel gray valueb, next threshold value is calculated according to formula (2-3);
      Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page
The gray value of all pixels point is compared with threshold value T in photo.The pixel gray value that will be greater than T is set as 255, will be less than
The pixel gray value of T is set as 0;
      Step 1.6: by the eight pixel ashes adjacent with surrounding for 0 pixel of each gray value in books title page picture
Angle value is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its ash
Angle value is set to 255.
      Preferably, step 3 specifically:
      The text information in books title page picture obtained by step 2 in the ideal case with regular expression (3-1) phase
Matching:
      | ([s S] *) [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [/-|]) d { 1,2 } [s S] * ISBN (.*) | 
Regular expression (3-1)
      Match under the lower circumstances of books title page picture clarity with regular expression (3-2):
      | ([s S] *) [/, ] (.*?) [write and write] .* [- mono- _] * .*? [: :] (.*?) [, ,] and (.*?) [s S] * [iI1 |] [Ss5] [B8] [NM] (.*) | 
Regular expression (3-2)
      According to regular expressions, extracting in " () " includes in CIP data of the content as shared books
Hold, from left to right successively are as follows: book name, editor's name, publishing house, Publication Year and ISBN number.
      Each symbology what meaning in yellow flag part
      In regular expression (3-1) and regular expression (3-2), each symbol meaning is as follows: s indicate include space, change
All blank characters including row, retraction symbol;S indicate all non-blank-white characters;* indicate that the subexpression of matching front is any
It is secondary;.*? it indicates to carry out all characters except newline and in addition to the carriage return character non-greedy matching;.* it indicates for except newline
Greedy matching is carried out with all characters except the carriage return character;[] indicates character set, any included in matching " [" with "] "
One character;D indicates one numerical character of matching, can be equivalent to [0-9];{ n } indicates the subexpression n times of matching front,
Middle n is a determining nonnegative integer;() indicates that the character string that the part for including between (with) is matched to is required extraction
Cataloguing in publication information.Except the above, include in regular expression (3-1) and regular expression (3-2) as volume, work,
It writes, remaining symbol such as I, S, B, N indicates to match itself with character.
      The present invention provides a kind of in the case where shared book system is not connected with the off-line case of internet, directly passes through books door leaf
The method that page picture obtains CIP data.So that the book information in shared book system obtains in process, reduce
Manual steps, save human cost.
    Detailed description of the invention
      Fig. 1 is the flow chart for obtaining CIP data in shared book system offline.
      Fig. 2 is the example for obtaining CIP data in shared book system offline.
    Specific embodiment
      The present invention provides a kind of method for obtaining CIP data (CIP) in shared book system offline, such as Fig. 1
Shown, specific embodiment comprises the steps of:
      Step 1: gray processing, binaryzation and noise reduction being carried out to books title page photo and pre-processed, it is ensured that the identification of step 2 is accurate
Rate.Specifically:
      Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel in books title page picture
Coordinate be (x, y), if the pixel red color component value be R (x, y), green component values be G (x, y), blue color component value be B (x,
y).Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains each pixel
Gray value;
      F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y) (2-1)
      Step 1.2: according to step 1.1 gained pixel gray value, finding in all books title page picture pixels points most
High-gray level value and minimum gradation value.Assuming that maximum gradation value is Graymax, minimum gradation value Graymin, binarization threshold T.
Binaryzation initial threshold is calculated according to formula (2-2);
      T (0)=(Graymin+Graymax)/2  (2-2)
      Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).It will be grey
The pixel that angle value is less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel
Point.The average value Gray of all foreground pixel point gray values is found out respectivelyfWith the average value Gray of background pixel gray valueb, press
Next threshold value is calculated according to formula (2-3);
      T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
      Step 1.4: comparing T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture
Gray value and previous step obtained by threshold value T (n) be compared.Pixel by gray value less than T (n) ranges foreground pixel
Point, the pixel by gray value greater than T (n) range background pixel.The flat of all foreground pixel point gray values is found out respectively
Mean value GrayfWith the average value Gray of background pixel gray valueb, next threshold value is calculated according to formula (2-3);
      Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page
The gray value of all pixels point is compared with threshold value T in photo.The pixel gray value that will be greater than T is set as 255, will be less than
The pixel gray value of T is set as 0;
      Step 1.6: by the eight pixel ashes adjacent with surrounding for 0 pixel of each gray value in books title page picture
Angle value is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its ash
Angle value is set to 255.
      Step 2: optical character identification being carried out to treated books title page picture using Tesseract-OCR and obtains door leaf
Text information in page.
      Step 3: effective information is extracted from the text information that step 2 is got and according to CIP data model
Attribute in looking for is matched and extracts CIP data.Text letter in the books title page picture obtained by step 2
Breath matches with regular expression (3-1) in the ideal case:
      | ([s S]) * [/] (.*?) [write and write] .* [-] * .*? [: :] (.*?) [, ] (d { 4 } [- |/|]) d { 1,2 } [s S] * ISBN (.*) | 
Regular expression (3-1)
      Match under the lower circumstances of books title page picture clarity with regular expression (3-2):
      | ([s S] *) [/, ] (.*?) [write and write] .* [- mono- _] * .*? [: :] (.*?) [, ,] and (.*?) [s S] * [iI1 |] [Ss5] [B8] [NM] (.*) | 
Regular expression (3-2)
      According to regular expressions, extracting in " () " includes in CIP data of the content as shared books
Hold, from left to right successively are as follows: book name, editor's name, publishing house, Publication Year and ISBN number.
    Embodiment 1
      As shown in Fig. 2, left side is that user uploads books title page picture in figure, right side is that the method provided through the invention is adopted
The CIP data collected.
    Claims (4)
1. the method for CIP data is obtained in a kind of shared book system offline, which is characterized in that including following step
It is rapid:
      Step 1 pre-processes books title page picture, by the pixel of the pixel of the composition text in picture and composition background
Point mutually separates and removes the noise of disturbance ecology;
      Step 2 carries out optical character identification to by pretreated books title page picture, obtains the text information in picture;
      Step 3, from extracting effective information in the text information that step 2 is got and being looked for according to CIP data model
Attribute matched and extract CIP data.
    2. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that in step 1
It includes: to carry out gray processing, binaryzation and noise reduction process to books title page photo that books title page picture, which carries out pretreatment,.
    3. as claim 1 shares the method for obtaining CIP data (CIP) in book system, which is characterized in that step
Optical character identification is carried out to treated books title page picture using Tesseract-OCR in rapid 2.
    4. as claim 1 shares the method for obtaining CIP data in book system, which is characterized in that step 1 tool
Body are as follows:
      Step 1.1: obtaining the rgb value of each pixel in books title page picture.Assuming that the pixel coordinate in books title page picture
For (x, y), if the pixel red color component value is R (x, y), green component values are G (x, y), and blue color component value is B (x, y).It presses
Gray processing processing is carried out to the pixel rgb value in books title page picture according to formula (2-1), obtains the gray scale of each pixel
Value;
      F (x, y)=0.30R (x, y)+0.59G (x, y)+0.11B (x, y);  (2-1)
      Step 1.2: according to step 1.1 gained pixel gray value, finding the maximum ash in all books title page picture pixels points
Angle value and minimum gradation value, it is assumed that maximum gradation value Graymax, minimum gradation value Graymin, binarization threshold T.According to
Formula (2-2) calculates binaryzation initial threshold;
      T (0)=(Graymin+Graymax)/2;  (2-2)
      Step 1.3: all pixels point in books title page picture is compared with step 1.2 gained threshold value T (0).By gray value
Pixel less than T (0) ranges foreground pixel point, and the pixel that gray value is greater than T (0) is ranged background pixel;Point
The average value Gray of all foreground pixel point gray values is not found outfWith the average value Gray of background pixel gray valueb, according to public affairs
Formula (2-3) calculates next threshold value;
      T (n)=(Grayf+Grayb)/2 (n=1,2,3 ...) (2-3)
      Step 1.4: compare T (n-1) and T (n), if T (n-1) ≠ T (n), by all pixels point in books title page picture with it is upper
Threshold value T (n) obtained by one step is compared.Pixel by gray value less than T (n) ranges foreground pixel point, and gray value is big
Background pixel is ranged in the pixel of T (n).The average value Gray of all foreground pixel point gray values is found out respectivelyfWith it is rear
The average value Gray of scape pixel gray valueb, next threshold value is calculated according to formula (2-3);
      Step 1.5: repeating step 1.4, when T (n-1)=T (n), threshold value T obtains optimum value.At this time by books title page photo
The gray value of middle all pixels point is compared with threshold value T.The pixel gray value that will be greater than T is set as 255, will be less than T's
Pixel gray value is set as 0;
      Step 1.6: by eight pixel gray values adjacent with surrounding for 0 pixel of each gray value in books title page picture
It is compared.If the gray value of eight adjacent pixels of surrounding is 255, which is isolated point, by its gray value
It is set to 255.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201811085837.5A CN109344297A (en) | 2018-09-18 | 2018-09-18 | A method for offline acquisition of in-version cataloging data of books in a shared book system | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201811085837.5A CN109344297A (en) | 2018-09-18 | 2018-09-18 | A method for offline acquisition of in-version cataloging data of books in a shared book system | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN109344297A true CN109344297A (en) | 2019-02-15 | 
Family
ID=65305408
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201811085837.5A Withdrawn CN109344297A (en) | 2018-09-18 | 2018-09-18 | A method for offline acquisition of in-version cataloging data of books in a shared book system | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN109344297A (en) | 
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN113254676A (en) * | 2021-06-24 | 2021-08-13 | 广州卓一信息科技有限公司 | Digital multimedia library management method and system | 
| CN118609140A (en) * | 2024-06-28 | 2024-09-06 | 上海阿法迪智能数字科技股份有限公司 | Real-time collection and input method of book in-print cataloging data and book collection and editing equipment | 
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20060072830A1 (en) * | 2004-02-26 | 2006-04-06 | Xerox Corporation | Method for automated image indexing and retrieval | 
| CN102289668A (en) * | 2011-09-07 | 2011-12-21 | 谭洪舟 | Binaryzation processing method of self-adaption word image based on pixel neighborhood feature | 
| CN103823454A (en) * | 2014-03-12 | 2014-05-28 | 黄昱俊 | System and method for inquiring and locating books based on machine vision | 
| CN104598435A (en) * | 2015-01-20 | 2015-05-06 | 青岛农业大学 | Method for solving prepress catalog data localization | 
| CN105354574A (en) * | 2015-12-04 | 2016-02-24 | 山东博昂信息科技有限公司 | Vehicle number recognition method and device | 
| CN106156761A (en) * | 2016-08-10 | 2016-11-23 | 北京交通大学 | The image form detection of facing moving terminal shooting and recognition methods | 
- 
        2018
        - 2018-09-18 CN CN201811085837.5A patent/CN109344297A/en not_active Withdrawn
 
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20060072830A1 (en) * | 2004-02-26 | 2006-04-06 | Xerox Corporation | Method for automated image indexing and retrieval | 
| CN102289668A (en) * | 2011-09-07 | 2011-12-21 | 谭洪舟 | Binaryzation processing method of self-adaption word image based on pixel neighborhood feature | 
| CN103823454A (en) * | 2014-03-12 | 2014-05-28 | 黄昱俊 | System and method for inquiring and locating books based on machine vision | 
| CN104598435A (en) * | 2015-01-20 | 2015-05-06 | 青岛农业大学 | Method for solving prepress catalog data localization | 
| CN105354574A (en) * | 2015-12-04 | 2016-02-24 | 山东博昂信息科技有限公司 | Vehicle number recognition method and device | 
| CN106156761A (en) * | 2016-08-10 | 2016-11-23 | 北京交通大学 | The image form detection of facing moving terminal shooting and recognition methods | 
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN113254676A (en) * | 2021-06-24 | 2021-08-13 | 广州卓一信息科技有限公司 | Digital multimedia library management method and system | 
| CN118609140A (en) * | 2024-06-28 | 2024-09-06 | 上海阿法迪智能数字科技股份有限公司 | Real-time collection and input method of book in-print cataloging data and book collection and editing equipment | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| Burie et al. | ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) | |
| CN105678293A (en) | Complex image and text sequence identification method based on CNN-RNN | |
| Yousfi et al. | ALIF: A dataset for Arabic embedded text recognition in TV broadcast | |
| CN105184238A (en) | Human face recognition method and system | |
| CN105678292A (en) | Complex optical text sequence identification system based on convolution and recurrent neural network | |
| CN105678300A (en) | Complex image and text sequence identification method | |
| CN104268541A (en) | Intelligent image identification method of device nameplate and energy efficiency label | |
| CN102750379B (en) | Fast character string matching method based on filtering type | |
| Isheawy et al. | Optical character recognition (OCR) system | |
| Suryani et al. | The handwritten sundanese palm leaf manuscript dataset from 15th century | |
| CN103995904A (en) | Recognition system for image file electronic data | |
| CN113516041A (en) | Tibetan ancient book document image layout segmentation and identification method and system | |
| Sahu et al. | An efficient handwritten Devnagari character recognition system using neural network | |
| CN108108482B (en) | Method for realizing scene reality enhancement in scene conversion | |
| JP6882362B2 (en) | Systems and methods for identifying images, including identification documents | |
| CN109344297A (en) | A method for offline acquisition of in-version cataloging data of books in a shared book system | |
| Kesiman et al. | ICFHR 2018 competition on document image analysis tasks for southeast asian palm leaf manuscripts | |
| CN106776880A (en) | A kind of paper based on picture and text identification reviews system and method | |
| CN117763175A (en) | Heterogeneous knowledge-fused multi-strategy image retrieval method and system | |
| Alamsyah et al. | Autoencoder image denoising to increase optical character recognition performance in text conversion | |
| Antony et al. | A framework for recognition of handwritten South Dravidian Tulu script | |
| Thaker et al. | Structural feature extraction to recognize some of the offline isolated handwritten Gujarati characters using decision tree classifier | |
| Adyanthaya | Text recognition from images: a study | |
| CN110852359B (en) | Genealogy recognition method and system based on deep learning | |
| Ahmed et al. | Enhancing the character segmentation accuracy of bangla ocr using bpnn | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication | Application publication date: 20190215 | |
| WW01 | Invention patent application withdrawn after publication |