[go: up one dir, main page]

CN109726286A - A kind of library automatic classification method based on LDA topic model - Google Patents

A kind of library automatic classification method based on LDA topic model Download PDF

Info

Publication number
CN109726286A
CN109726286A CN201811584226.5A CN201811584226A CN109726286A CN 109726286 A CN109726286 A CN 109726286A CN 201811584226 A CN201811584226 A CN 201811584226A CN 109726286 A CN109726286 A CN 109726286A
Authority
CN
China
Prior art keywords
books
classification
book
book labels
sorted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811584226.5A
Other languages
Chinese (zh)
Other versions
CN109726286B (en
Inventor
符俊涛
王超芸
李曲
应文佳
马堃
沈钦壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinxun Digital Technology Hangzhou Co ltd
Original Assignee
Hangzhou Dongxin Beiyou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dongxin Beiyou Information Technology Co Ltd filed Critical Hangzhou Dongxin Beiyou Information Technology Co Ltd
Priority to CN201811584226.5A priority Critical patent/CN109726286B/en
Publication of CN109726286A publication Critical patent/CN109726286A/en
Application granted granted Critical
Publication of CN109726286B publication Critical patent/CN109726286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of library automatic classification method based on LDA topic model, comprising: establish classification system;Known class books are chosen as training books, the label of all trained books constitutes book labels and always collects, and always concentrates each label to distribute a unique serial number for book labels;A multinomial distribution model is constructed and trains, the input of multinomial distribution model is the book labels that trained books include and training books classification, and output is always to concentrate the probability of each label in different classes of following figure book label;Book labels are picked out from books to be sorted, and constitute the tag set of books to be sorted, it is then based on LDA topic model, each book labels sampling one classification of distribution for using the Gibbs method of sampling to include by books to be sorted, after reaching convergence, the score of each classification belonging to books to be sorted is counted, obtains books generic to be sorted accordingly.The invention belongs to information technology fields, can realize library automatic classification based on LDA topic model.

Description

A kind of library automatic classification method based on LDA topic model
Technical field
The present invention relates to a kind of library automatic classification methods based on LDA topic model, belong to information technology field.
Background technique
Book classification suffers from always important meaning for the online and offline books mechanism for keeping a large amount of books. The online literature platform and internet book store, accurate book classification praised highly for emerging reader groups are that various book recommendations are accurate Basis, and for carry traditional publication literature library and entity bookstore, accurate book classification can be improved management effect Rate and promotion user experience.For these mechanisms, due to there is much the new of the old book and continuous restocking that need to correct classification The problems such as book, currently there are heavy workload, low efficiency, classification subjectivityizatioies, inaccuracy in a manner of the book classification based on artificial, because The library automatic classification method of a kind of efficiently and accurately of the invention, it appears increasingly important.
Current library automatic classification algorithm is focused primarily upon using naive Bayesian, support vector machines and neural network etc. Machine learning algorithm.Since books are substantially the set of a pile text, the books of classification both may include online literature, can also wrap Containing traditional literature, the above method can not reach good effect.
Traditional LDA topic model based on NLP (natural language processing) is unsupervised learning, directly applies LDA master Topic model is equivalent to be clustered some books, this is runed counter to regard to and to the original intention that books are classified, therefore, how LDA topic model is transformed, to be applied to library automatic classification, it has also become technical staff's technology urgently to be solved is asked Topic.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of library automatic classification method based on LDA topic model, it can base The automatic classification of books is realized in LDA topic model.
In order to achieve the above object, the present invention provides a kind of library automatic classification method based on LDA topic model, packets It has included:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels from every trained books, institute as training books There are the book labels of trained books to constitute book labels always to collect, and each book labels always concentrated for book labels distribute one Unique serial number;
Step 3: to train books as sample, construct simultaneously one multinomial distribution model of training, multinomial distribution model it is defeated Entering is every trained books all book labels for including and training books generic, and output is in different classes of following figure book label Sign the probability for each book labels always concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute to be sorted Tag set W=(the w of books1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wd It is the book labels that books to be sorted are included respectively, is then based on LDA topic model, according in different classes of following figure book label The probability for each book labels always concentrated, each book labels for using the Gibbs method of sampling to include by books to be sorted Sampling one classification of distribution calculates probability different classes of belonging to each book labels of books to be sorted after reaching convergence Distribution, counts the score of each classification belonging to books to be sorted, to obtain the generic of books to be sorted accordingly.
Compared with prior art, the beneficial effects of the present invention are: the present invention is by traditional unsupervised LDA topic model, It transform the LDA topic model algorithm of supervision as, by the known book labels classified of training, obtains in different classes of Books Then the probability for each book labels that label is always concentrated applies the Gibbs method of sampling, calculate and obtain belonging to books to be sorted often The score of a classification, to realize the automatic classification of books;When calculating the score of each classification belonging to books to be sorted, this Invention is calculated to be sorted not directly according to the classification number of statistical sample label direct in LDA model using probability distribution The probability of each classification belonging to books and, meanwhile, adjust the weight of each book labels, using IDF also so as to more accurate Identification books classification, reduce error caused by single sample.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the library automatic classification method based on LDA topic model of the present invention.
Fig. 2 is the specific steps flow chart of Fig. 1 step 4.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made with reference to the accompanying drawing further Detailed description.
As shown in Figure 1, a kind of library automatic classification method based on LDA topic model of the present invention, includes:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels from every trained books, institute as training books There are the book labels of trained books to constitute book labels always to collect, and each book labels always concentrated for book labels distribute one Unique serial number;
Step 3: to train books as sample, construct simultaneously one multinomial distribution model of training, multinomial distribution model it is defeated Entering is every trained books all book labels for including and training books generic, and output is in different classes of following figure book label Sign the probability for each book labels always concentrated;
Wherein, all book labels generics that every trained books include are consistent with training books generic, and It is 1 in the sum of probability of all book labels that each classification following figure book label is always concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute to be sorted Tag set W=(the w of books1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wd It is the book labels that books to be sorted are included respectively, is then based on LDA topic model, according in different classes of following figure book label The probability for each book labels always concentrated, each book labels for using the Gibbs method of sampling to include by books to be sorted Sampling one classification of distribution calculates probability different classes of belonging to each book labels of books to be sorted after reaching convergence Distribution, counts the score of each classification belonging to books to be sorted, to obtain the generic of books to be sorted accordingly.
Books to be sorted for more, only need to repeat step 4 can successively obtain the affiliated class of every books to be sorted Not.
In step 2, NLP technology can be used, participle and part-of-speech tagging are carried out to the body part chapters and sections of training books, Effective noun is extracted as book labels.
It in step 3, is calculated according to probability statistics, the output of multinomial distribution model always collects in different classes of following figure book label In the Optimal calculation formula of probability of each book labels may is that v-th always concentrated in k-th of classification following figure book label The Probability p of book labelskvIt is the number and affiliated k-th of classification of v-th of book labels in all books of affiliated k-th of classification All books in all book labels number ratio.
As shown in Fig. 2, step 4 may further include:
Step 41, for one classification of each book labels random initializtion in books to be sorted, and i is initialized as 1;
Step 42 extracts i-th of book labels from the tag set W of books to be sorted;
Step 43 calculates probability distribution different classes of belonging to extracted i-th of book labels:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification, value can To set according to actual business requirement, such as all it is set as 1;
In step 44, probability distribution different classes of according to belonging to i-th of book labels, stochastical sampling obtains a class Not, the classification of i-th of book labels is updated to the classification obtained after sampling;
Does is i updated to i+1 by step 45, and judges that updated i is greater than d? if it is, indicating to have updated one All book labels in tag set W continue in next step;If it is not, then turning to step 42;
Step 46, judgement are when the classification of previous each book labels in the W updated and apart from a current nearest update The classification consistent degrees of each book labels reach convergence threshold? if it is, indicating to have reached convergence, continue next Step;If it is not, then updating i=1, step 42 is then turned to, is continued next all over the classification for updating each book labels in W;
Step 47 calculates probability different classes of belonging to each book labels in the tag set of books to be sorted:
Step 48, probability different classes of according to belonging to each book labels in the tag set of books to be sorted, meter The score of each classification belonging to books to be sorted is calculated, score maximum value, classification corresponding to score maximum value are then therefrom selected It is the generic of books to be sorted.
It is noted that the present invention is directly according to the classification number of statistical sample label direct in LDA model, but Calculated using probability distribution the probability of each classification belonging to books to be sorted and, so as to more accurately identify books classification, Error caused by single sample is reduced almost all to deposit in all class categories simultaneously as some book labels are more universal , so the classification discrimination of this book labels is little, therefore the present invention also further using IDF adjusted as weight to The score for each classification belonging to books of classifying, further includes having:
The probability of each book labels always concentrated in different classes of following figure book label obtained, meter are calculated according to step 3 The IDF for each book labels that nomogram book label is always concentrated:Wherein, idfvIt is that book labels are total V-th of the book labels b concentratedvIDF value, num-type (bv) it is that obtain after step 3 input sample data includes v A book labels bvAll categories number,
In this way, the calculation formula for counting the score of each classification belonging to books to be sorted is in step 4:scorekIt is the score of k-th of classification belonging to books to be sorted, p (zi=k, wi) be I-th of book labels wiThe probability of k-th of classification in affiliated classification system, v are that i-th of book labels of books to be sorted exist The serial number that book labels are always concentrated.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims (6)

1. a kind of library automatic classification method based on LDA topic model, which is characterized in that include:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels, Suo Youxun from every trained books as training books Each book labels that the book labels for practicing books constitute book labels and always collect, and always concentrate for book labels distribute one uniquely Serial number;
Step 3: constructing to train books as sample and one multinomial distribution model of training, the input of multinomial distribution model being All book labels and training books generic that every trained books include, output is total in different classes of following figure book label The probability for each book labels concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute books to be sorted Tag set W=(w1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wdRespectively It is the book labels that books to be sorted are included, is then based on LDA topic model, always collects according in different classes of following figure book label In each book labels probability, each book labels for using the Gibbs method of sampling to include by books to be sorted sample A classification is distributed, after reaching convergence, calculates probability distribution different classes of belonging to each book labels of books to be sorted, The score of each classification belonging to books to be sorted is counted, to obtain the generic of books to be sorted accordingly.
2. the method according to claim 1, wherein in step 2, with NLP technology, just to training books Literary section carries out participle and part-of-speech tagging, extracts effective noun as book labels.
3. the method according to claim 1, wherein in step 3, the output of multinomial distribution model in inhomogeneity The calculation formula of the probability for each book labels that other following figure book label is always concentrated is: always collecting in k-th of classification following figure book label In v-th of book labels Probability pkvIt is the number of v-th book labels and affiliated in all books of affiliated k-th of classification The ratio of the number of all book labels in all books of k-th of classification.
4. the method according to claim 1, wherein step 4 further comprises having:
Step 41, for one classification of each book labels random initializtion in books to be sorted, and i is initialized as 1;
Step 42 extracts i-th of book labels from the tag set W of books to be sorted;
Step 43 calculates probability distribution different classes of belonging to extracted i-th of book labels:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification;
In step 44, probability distribution different classes of according to belonging to i-th of book labels, stochastical sampling obtains a classification, will The classification of i-th of book labels is updated to the classification obtained after sampling;
I is updated to i+1 by step 45, and judges whether updated i is greater than d, if it is, indicating to have updated a mark All book labels in set W are signed, are continued in next step;If it is not, then turning to step 42;
Step 46, judgement are when the classification of previous each book labels in the W updated and apart from the every of a current nearest update Whether the classification consistent degree of a book labels reaches convergence threshold, if it is, indicating to have reached convergence;If it is not, then updating i =1, step 42 is then turned to, is continued next all over the classification for updating each book labels in W.
5. the method according to claim 1, wherein after reaching convergence, calculating books to be sorted in step 4 Each book labels belonging to different classes of probability distribution, count the score of each classification belonging to books to be sorted, thus according to This obtains the generic of books to be sorted, further comprises having:
Step A1, probability different classes of belonging to each book labels in the tag set of books to be sorted is calculated:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification;
Step A2, probability different classes of according to belonging to each book labels in the tag set of books to be sorted, calculate to The score for each classification belonging to books of classifying, then therefrom selects score maximum value, classification corresponding to score maximum value is The generic of books to be sorted.
6. the method according to claim 1, wherein further including having:
The probability of each book labels always concentrated in different classes of following figure book label obtained is calculated according to step 3, calculates figure The IDF for each book labels that book label is always concentrated:Wherein, idfvIt is that book labels are always concentrated V-th of book labels bvIDF value, num-type (bv) it is that obtain after step 3 input sample data includes v-th of figure Book label bvAll categories number,
In step 4, the calculation formula for counting the score of each classification belonging to books to be sorted is:scorekIt is the score of k-th of classification belonging to books to be sorted, p (zi=k, wi) be I-th of book labels wiThe probability of k-th of classification in affiliated classification system, v are that i-th of book labels of books to be sorted exist The serial number that book labels are always concentrated.
CN201811584226.5A 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model Active CN109726286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811584226.5A CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811584226.5A CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Publications (2)

Publication Number Publication Date
CN109726286A true CN109726286A (en) 2019-05-07
CN109726286B CN109726286B (en) 2020-10-16

Family

ID=66296376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811584226.5A Active CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Country Status (1)

Country Link
CN (1) CN109726286B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 A Bayesian-based LDA topic label calibration method, system and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN103473309A (en) * 2013-09-10 2013-12-25 浙江大学 Text categorization method based on probability word selection and supervision subject model
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
US9342591B2 (en) * 2012-02-14 2016-05-17 International Business Machines Corporation Apparatus for clustering a plurality of documents
CN106326495A (en) * 2016-09-27 2017-01-11 浪潮软件集团有限公司 A Chinese Text Automatic Classification Method Based on Topic Model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
US9342591B2 (en) * 2012-02-14 2016-05-17 International Business Machines Corporation Apparatus for clustering a plurality of documents
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN103473309A (en) * 2013-09-10 2013-12-25 浙江大学 Text categorization method based on probability word selection and supervision subject model
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106326495A (en) * 2016-09-27 2017-01-11 浪潮软件集团有限公司 A Chinese Text Automatic Classification Method Based on Topic Model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宫小翠等: "基于Labeled LDA 主题模型的医学文献自动分类", 《中华医学图书情报杂志》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 A Bayesian-based LDA topic label calibration method, system and medium
CN110569270B (en) * 2019-08-15 2022-07-05 中国人民解放军国防科技大学 Bayesian-based LDA topic label calibration method, system and medium

Also Published As

Publication number Publication date
CN109726286B (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN104992184B (en) A kind of multiclass image classification method based on semi-supervised extreme learning machine
CN112632274B (en) Abnormal event classification method and system based on text processing
CN107861951A (en) Session subject identifying method in intelligent customer service
CN110807086B (en) Text data labeling method and device, storage medium and electronic equipment
CN101354714B (en) A Question Recommendation Method Based on Probabilistic Latent Semantic Analysis
CN110472665A (en) Model training method, file classification method and relevant apparatus
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN106709754A (en) Power user grouping method based on text mining
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN109472462A (en) A project risk rating method and device based on multi-model stack fusion
CN110866121A (en) A method for constructing knowledge graph for electric power field
CN107885849A (en) A kind of moos index analysis system based on text classification
CN104834918A (en) Human behavior recognition method based on Gaussian process classifier
CN106708947A (en) Big data-based web article forwarding recognition method
CN110309234A (en) A kind of client of knowledge based map holds position method for early warning, device and storage medium
CN109388749A (en) The detection of accurate high-efficiency network public sentiment and method for early warning based on multi-layer geography
CN117807323A (en) Online interactive smart Prime big data platform
CN109726286A (en) A kind of library automatic classification method based on LDA topic model
CN108536673A (en) Media event abstracting method and device
CN108694176A (en) Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
WO2020135054A1 (en) Method, device and apparatus for video recommendation and storage medium
CN119961443A (en) An intelligent recommendation method and system based on scholars' academic background and user tags
CN109657122A (en) A kind of Academic Teams' important member's recognition methods based on academic big data
CN116304356B (en) A multi-scene content creation and application system for scenic spots based on AIGC
CN108764537B (en) A-TrAdaboost algorithm-based multi-source community label development trend prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Xinxun Digital Technology (Hangzhou) Co.,Ltd.

Address before: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee before: EB Information Technology Ltd.