CN109726286A - A kind of library automatic classification method based on LDA topic model - Google Patents
A kind of library automatic classification method based on LDA topic model Download PDFInfo
- Publication number
- CN109726286A CN109726286A CN201811584226.5A CN201811584226A CN109726286A CN 109726286 A CN109726286 A CN 109726286A CN 201811584226 A CN201811584226 A CN 201811584226A CN 109726286 A CN109726286 A CN 109726286A
- Authority
- CN
- China
- Prior art keywords
- books
- classification
- book
- book labels
- sorted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of library automatic classification method based on LDA topic model, comprising: establish classification system;Known class books are chosen as training books, the label of all trained books constitutes book labels and always collects, and always concentrates each label to distribute a unique serial number for book labels;A multinomial distribution model is constructed and trains, the input of multinomial distribution model is the book labels that trained books include and training books classification, and output is always to concentrate the probability of each label in different classes of following figure book label;Book labels are picked out from books to be sorted, and constitute the tag set of books to be sorted, it is then based on LDA topic model, each book labels sampling one classification of distribution for using the Gibbs method of sampling to include by books to be sorted, after reaching convergence, the score of each classification belonging to books to be sorted is counted, obtains books generic to be sorted accordingly.The invention belongs to information technology fields, can realize library automatic classification based on LDA topic model.
Description
Technical field
The present invention relates to a kind of library automatic classification methods based on LDA topic model, belong to information technology field.
Background technique
Book classification suffers from always important meaning for the online and offline books mechanism for keeping a large amount of books.
The online literature platform and internet book store, accurate book classification praised highly for emerging reader groups are that various book recommendations are accurate
Basis, and for carry traditional publication literature library and entity bookstore, accurate book classification can be improved management effect
Rate and promotion user experience.For these mechanisms, due to there is much the new of the old book and continuous restocking that need to correct classification
The problems such as book, currently there are heavy workload, low efficiency, classification subjectivityizatioies, inaccuracy in a manner of the book classification based on artificial, because
The library automatic classification method of a kind of efficiently and accurately of the invention, it appears increasingly important.
Current library automatic classification algorithm is focused primarily upon using naive Bayesian, support vector machines and neural network etc.
Machine learning algorithm.Since books are substantially the set of a pile text, the books of classification both may include online literature, can also wrap
Containing traditional literature, the above method can not reach good effect.
Traditional LDA topic model based on NLP (natural language processing) is unsupervised learning, directly applies LDA master
Topic model is equivalent to be clustered some books, this is runed counter to regard to and to the original intention that books are classified, therefore, how
LDA topic model is transformed, to be applied to library automatic classification, it has also become technical staff's technology urgently to be solved is asked
Topic.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of library automatic classification method based on LDA topic model, it can base
The automatic classification of books is realized in LDA topic model.
In order to achieve the above object, the present invention provides a kind of library automatic classification method based on LDA topic model, packets
It has included:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels from every trained books, institute as training books
There are the book labels of trained books to constitute book labels always to collect, and each book labels always concentrated for book labels distribute one
Unique serial number;
Step 3: to train books as sample, construct simultaneously one multinomial distribution model of training, multinomial distribution model it is defeated
Entering is every trained books all book labels for including and training books generic, and output is in different classes of following figure book label
Sign the probability for each book labels always concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute to be sorted
Tag set W=(the w of books1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wd
It is the book labels that books to be sorted are included respectively, is then based on LDA topic model, according in different classes of following figure book label
The probability for each book labels always concentrated, each book labels for using the Gibbs method of sampling to include by books to be sorted
Sampling one classification of distribution calculates probability different classes of belonging to each book labels of books to be sorted after reaching convergence
Distribution, counts the score of each classification belonging to books to be sorted, to obtain the generic of books to be sorted accordingly.
Compared with prior art, the beneficial effects of the present invention are: the present invention is by traditional unsupervised LDA topic model,
It transform the LDA topic model algorithm of supervision as, by the known book labels classified of training, obtains in different classes of Books
Then the probability for each book labels that label is always concentrated applies the Gibbs method of sampling, calculate and obtain belonging to books to be sorted often
The score of a classification, to realize the automatic classification of books;When calculating the score of each classification belonging to books to be sorted, this
Invention is calculated to be sorted not directly according to the classification number of statistical sample label direct in LDA model using probability distribution
The probability of each classification belonging to books and, meanwhile, adjust the weight of each book labels, using IDF also so as to more accurate
Identification books classification, reduce error caused by single sample.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the library automatic classification method based on LDA topic model of the present invention.
Fig. 2 is the specific steps flow chart of Fig. 1 step 4.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made with reference to the accompanying drawing further
Detailed description.
As shown in Figure 1, a kind of library automatic classification method based on LDA topic model of the present invention, includes:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels from every trained books, institute as training books
There are the book labels of trained books to constitute book labels always to collect, and each book labels always concentrated for book labels distribute one
Unique serial number;
Step 3: to train books as sample, construct simultaneously one multinomial distribution model of training, multinomial distribution model it is defeated
Entering is every trained books all book labels for including and training books generic, and output is in different classes of following figure book label
Sign the probability for each book labels always concentrated;
Wherein, all book labels generics that every trained books include are consistent with training books generic, and
It is 1 in the sum of probability of all book labels that each classification following figure book label is always concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute to be sorted
Tag set W=(the w of books1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wd
It is the book labels that books to be sorted are included respectively, is then based on LDA topic model, according in different classes of following figure book label
The probability for each book labels always concentrated, each book labels for using the Gibbs method of sampling to include by books to be sorted
Sampling one classification of distribution calculates probability different classes of belonging to each book labels of books to be sorted after reaching convergence
Distribution, counts the score of each classification belonging to books to be sorted, to obtain the generic of books to be sorted accordingly.
Books to be sorted for more, only need to repeat step 4 can successively obtain the affiliated class of every books to be sorted
Not.
In step 2, NLP technology can be used, participle and part-of-speech tagging are carried out to the body part chapters and sections of training books,
Effective noun is extracted as book labels.
It in step 3, is calculated according to probability statistics, the output of multinomial distribution model always collects in different classes of following figure book label
In the Optimal calculation formula of probability of each book labels may is that v-th always concentrated in k-th of classification following figure book label
The Probability p of book labelskvIt is the number and affiliated k-th of classification of v-th of book labels in all books of affiliated k-th of classification
All books in all book labels number ratio.
As shown in Fig. 2, step 4 may further include:
Step 41, for one classification of each book labels random initializtion in books to be sorted, and i is initialized as 1;
Step 42 extracts i-th of book labels from the tag set W of books to be sorted;
Step 43 calculates probability distribution different classes of belonging to extracted i-th of book labels:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system
K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed
The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value
It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively
Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification, value can
To set according to actual business requirement, such as all it is set as 1;
In step 44, probability distribution different classes of according to belonging to i-th of book labels, stochastical sampling obtains a class
Not, the classification of i-th of book labels is updated to the classification obtained after sampling;
Does is i updated to i+1 by step 45, and judges that updated i is greater than d? if it is, indicating to have updated one
All book labels in tag set W continue in next step;If it is not, then turning to step 42;
Step 46, judgement are when the classification of previous each book labels in the W updated and apart from a current nearest update
The classification consistent degrees of each book labels reach convergence threshold? if it is, indicating to have reached convergence, continue next
Step;If it is not, then updating i=1, step 42 is then turned to, is continued next all over the classification for updating each book labels in W;
Step 47 calculates probability different classes of belonging to each book labels in the tag set of books to be sorted:
Step 48, probability different classes of according to belonging to each book labels in the tag set of books to be sorted, meter
The score of each classification belonging to books to be sorted is calculated, score maximum value, classification corresponding to score maximum value are then therefrom selected
It is the generic of books to be sorted.
It is noted that the present invention is directly according to the classification number of statistical sample label direct in LDA model, but
Calculated using probability distribution the probability of each classification belonging to books to be sorted and, so as to more accurately identify books classification,
Error caused by single sample is reduced almost all to deposit in all class categories simultaneously as some book labels are more universal
, so the classification discrimination of this book labels is little, therefore the present invention also further using IDF adjusted as weight to
The score for each classification belonging to books of classifying, further includes having:
The probability of each book labels always concentrated in different classes of following figure book label obtained, meter are calculated according to step 3
The IDF for each book labels that nomogram book label is always concentrated:Wherein, idfvIt is that book labels are total
V-th of the book labels b concentratedvIDF value, num-type (bv) it is that obtain after step 3 input sample data includes v
A book labels bvAll categories number,
In this way, the calculation formula for counting the score of each classification belonging to books to be sorted is in step 4:scorekIt is the score of k-th of classification belonging to books to be sorted, p (zi=k, wi) be
I-th of book labels wiThe probability of k-th of classification in affiliated classification system, v are that i-th of book labels of books to be sorted exist
The serial number that book labels are always concentrated.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (6)
1. a kind of library automatic classification method based on LDA topic model, which is characterized in that include:
Step 1: foundation includes the classification system of K classification;
Step 2: the books for choosing known class extract book labels, Suo Youxun from every trained books as training books
Each book labels that the book labels for practicing books constitute book labels and always collect, and always concentrate for book labels distribute one uniquely
Serial number;
Step 3: constructing to train books as sample and one multinomial distribution model of training, the input of multinomial distribution model being
All book labels and training books generic that every trained books include, output is total in different classes of following figure book label
The probability for each book labels concentrated;
Step 4: picking out its book labels always concentrated in book labels from books to be sorted, and constitute books to be sorted
Tag set W=(w1, w2..., wd), wherein d is the book labels number that books to be sorted are included, w1、w2、…、wdRespectively
It is the book labels that books to be sorted are included, is then based on LDA topic model, always collects according in different classes of following figure book label
In each book labels probability, each book labels for using the Gibbs method of sampling to include by books to be sorted sample
A classification is distributed, after reaching convergence, calculates probability distribution different classes of belonging to each book labels of books to be sorted,
The score of each classification belonging to books to be sorted is counted, to obtain the generic of books to be sorted accordingly.
2. the method according to claim 1, wherein in step 2, with NLP technology, just to training books
Literary section carries out participle and part-of-speech tagging, extracts effective noun as book labels.
3. the method according to claim 1, wherein in step 3, the output of multinomial distribution model in inhomogeneity
The calculation formula of the probability for each book labels that other following figure book label is always concentrated is: always collecting in k-th of classification following figure book label
In v-th of book labels Probability pkvIt is the number of v-th book labels and affiliated in all books of affiliated k-th of classification
The ratio of the number of all book labels in all books of k-th of classification.
4. the method according to claim 1, wherein step 4 further comprises having:
Step 41, for one classification of each book labels random initializtion in books to be sorted, and i is initialized as 1;
Step 42 extracts i-th of book labels from the tag set W of books to be sorted;
Step 43 calculates probability distribution different classes of belonging to extracted i-th of book labels:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system
K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed
The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value
It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively
Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification;
In step 44, probability distribution different classes of according to belonging to i-th of book labels, stochastical sampling obtains a classification, will
The classification of i-th of book labels is updated to the classification obtained after sampling;
I is updated to i+1 by step 45, and judges whether updated i is greater than d, if it is, indicating to have updated a mark
All book labels in set W are signed, are continued in next step;If it is not, then turning to step 42;
Step 46, judgement are when the classification of previous each book labels in the W updated and apart from the every of a current nearest update
Whether the classification consistent degree of a book labels reaches convergence threshold, if it is, indicating to have reached convergence;If it is not, then updating i
=1, step 42 is then turned to, is continued next all over the classification for updating each book labels in W.
5. the method according to claim 1, wherein after reaching convergence, calculating books to be sorted in step 4
Each book labels belonging to different classes of probability distribution, count the score of each classification belonging to books to be sorted, thus according to
This obtains the generic of books to be sorted, further comprises having:
Step A1, probability different classes of belonging to each book labels in the tag set of books to be sorted is calculated:Wherein, p (zi=k, wi) it is i-th of book labels wiIn affiliated classification system
K-th of classification probability, k=1,2 ... or K, ziIt is wiClassification, v is that i-th of book labels of books to be sorted are being schemed
The serial number that book label is always concentrated, pkvIt is the probability for v-th of the book labels always concentrated in k-th of classification following figure book label, value
It is calculated and is obtained by step 3, nk(-i)、nk'(-i)It is to be rejected from all book labels of the tag set W of books to be sorted respectively
Affiliated kth after i-th of book labels, the number of tags of k' classification, αk、αk'It is kth, the adjusting parameter of k' classification;
Step A2, probability different classes of according to belonging to each book labels in the tag set of books to be sorted, calculate to
The score for each classification belonging to books of classifying, then therefrom selects score maximum value, classification corresponding to score maximum value is
The generic of books to be sorted.
6. the method according to claim 1, wherein further including having:
The probability of each book labels always concentrated in different classes of following figure book label obtained is calculated according to step 3, calculates figure
The IDF for each book labels that book label is always concentrated:Wherein, idfvIt is that book labels are always concentrated
V-th of book labels bvIDF value, num-type (bv) it is that obtain after step 3 input sample data includes v-th of figure
Book label bvAll categories number,
In step 4, the calculation formula for counting the score of each classification belonging to books to be sorted is:scorekIt is the score of k-th of classification belonging to books to be sorted, p (zi=k, wi) be
I-th of book labels wiThe probability of k-th of classification in affiliated classification system, v are that i-th of book labels of books to be sorted exist
The serial number that book labels are always concentrated.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811584226.5A CN109726286B (en) | 2018-12-24 | 2018-12-24 | Automatic book classification method based on LDA topic model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811584226.5A CN109726286B (en) | 2018-12-24 | 2018-12-24 | Automatic book classification method based on LDA topic model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109726286A true CN109726286A (en) | 2019-05-07 |
| CN109726286B CN109726286B (en) | 2020-10-16 |
Family
ID=66296376
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811584226.5A Active CN109726286B (en) | 2018-12-24 | 2018-12-24 | Automatic book classification method based on LDA topic model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109726286B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | A Bayesian-based LDA topic label calibration method, system and medium |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101587493A (en) * | 2009-06-29 | 2009-11-25 | 中国科学技术大学 | Text classification method |
| CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
| CN103473309A (en) * | 2013-09-10 | 2013-12-25 | 浙江大学 | Text categorization method based on probability word selection and supervision subject model |
| CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
| US9342591B2 (en) * | 2012-02-14 | 2016-05-17 | International Business Machines Corporation | Apparatus for clustering a plurality of documents |
| CN106326495A (en) * | 2016-09-27 | 2017-01-11 | 浪潮软件集团有限公司 | A Chinese Text Automatic Classification Method Based on Topic Model |
-
2018
- 2018-12-24 CN CN201811584226.5A patent/CN109726286B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101587493A (en) * | 2009-06-29 | 2009-11-25 | 中国科学技术大学 | Text classification method |
| US9342591B2 (en) * | 2012-02-14 | 2016-05-17 | International Business Machines Corporation | Apparatus for clustering a plurality of documents |
| CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
| CN103473309A (en) * | 2013-09-10 | 2013-12-25 | 浙江大学 | Text categorization method based on probability word selection and supervision subject model |
| CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
| CN106326495A (en) * | 2016-09-27 | 2017-01-11 | 浪潮软件集团有限公司 | A Chinese Text Automatic Classification Method Based on Topic Model |
Non-Patent Citations (1)
| Title |
|---|
| 宫小翠等: "基于Labeled LDA 主题模型的医学文献自动分类", 《中华医学图书情报杂志》 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | A Bayesian-based LDA topic label calibration method, system and medium |
| CN110569270B (en) * | 2019-08-15 | 2022-07-05 | 中国人民解放军国防科技大学 | Bayesian-based LDA topic label calibration method, system and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109726286B (en) | 2020-10-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104992184B (en) | A kind of multiclass image classification method based on semi-supervised extreme learning machine | |
| CN112632274B (en) | Abnormal event classification method and system based on text processing | |
| CN107861951A (en) | Session subject identifying method in intelligent customer service | |
| CN110807086B (en) | Text data labeling method and device, storage medium and electronic equipment | |
| CN101354714B (en) | A Question Recommendation Method Based on Probabilistic Latent Semantic Analysis | |
| CN110472665A (en) | Model training method, file classification method and relevant apparatus | |
| CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
| CN106709754A (en) | Power user grouping method based on text mining | |
| CN106204156A (en) | A kind of advertisement placement method for network forum and device | |
| CN109472462A (en) | A project risk rating method and device based on multi-model stack fusion | |
| CN110866121A (en) | A method for constructing knowledge graph for electric power field | |
| CN107885849A (en) | A kind of moos index analysis system based on text classification | |
| CN104834918A (en) | Human behavior recognition method based on Gaussian process classifier | |
| CN106708947A (en) | Big data-based web article forwarding recognition method | |
| CN110309234A (en) | A kind of client of knowledge based map holds position method for early warning, device and storage medium | |
| CN109388749A (en) | The detection of accurate high-efficiency network public sentiment and method for early warning based on multi-layer geography | |
| CN117807323A (en) | Online interactive smart Prime big data platform | |
| CN109726286A (en) | A kind of library automatic classification method based on LDA topic model | |
| CN108536673A (en) | Media event abstracting method and device | |
| CN108694176A (en) | Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis | |
| WO2020135054A1 (en) | Method, device and apparatus for video recommendation and storage medium | |
| CN119961443A (en) | An intelligent recommendation method and system based on scholars' academic background and user tags | |
| CN109657122A (en) | A kind of Academic Teams' important member's recognition methods based on academic big data | |
| CN116304356B (en) | A multi-scene content creation and application system for scenic spots based on AIGC | |
| CN108764537B (en) | A-TrAdaboost algorithm-based multi-source community label development trend prediction method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CP01 | Change in the name or title of a patent holder | ||
| CP01 | Change in the name or title of a patent holder |
Address after: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Patentee after: Xinxun Digital Technology (Hangzhou) Co.,Ltd. Address before: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Patentee before: EB Information Technology Ltd. |