[go: up one dir, main page]

CN105677695B - A method of the calculating mobile application similitude based on content - Google Patents

A method of the calculating mobile application similitude based on content Download PDF

Info

Publication number
CN105677695B
CN105677695B CN201510776878.9A CN201510776878A CN105677695B CN 105677695 B CN105677695 B CN 105677695B CN 201510776878 A CN201510776878 A CN 201510776878A CN 105677695 B CN105677695 B CN 105677695B
Authority
CN
China
Prior art keywords
app
keyword
weight
checked
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510776878.9A
Other languages
Chinese (zh)
Other versions
CN105677695A (en
Inventor
吴明晖
刘泽民
金苍宏
应晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yuancheng Technology Co Ltd
Original Assignee
Hangzhou Yuancheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yuancheng Technology Co Ltd filed Critical Hangzhou Yuancheng Technology Co Ltd
Publication of CN105677695A publication Critical patent/CN105677695A/en
Application granted granted Critical
Publication of CN105677695B publication Critical patent/CN105677695B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method for the calculating mobile application similitude based on content that the present invention relates to a kind of.Include the following steps: after obtaining a large amount of mobile application information, carry out the extraction of mobile application information, including Apply Names, application type, using description and application size etc.;It will be segmented using description information;Content after participle is divided into two parts, the training corpus after portion integration as the model of word2vec, another is saved as the form of document sets and carries out the calculating of TF-IDF, and result is then stored in HBase data warehouse;Carry out app similarity query and calculating.Implement a kind of method of calculating mobile application similitude based on content of the invention, have the advantages that the similarity query that can quickly respond app, app feature and description information based on content can well generation refer to app, accuracy rate is high, and the accuracy rate of search and the recommendation of app can be improved.

Description

A method of the calculating mobile application similitude based on content
Technical field
It is the present invention relates to data information retrieval and recommender system field, in particular to a kind of to be completed by information retrieval The method of calculating mobile application similitude based on feature.
Background technique
With increasingly prosperous and " internet+" the proposition of mobile Internet, the convenience of mobile Internet and efficiently Property is more and more well known.The it is proposed of O2O (Online To Offline, under line on line) concept and various lines are online Under application, not only rapidly promoted the dealing of commodity, also greatly enriched people's lives.
In the life of public " internet+", the mobile application (Mobile Applications, abbreviation app) of magnanimity In occupation of the sufficient consequence of act.Domestic major mobile application market provides strong support for public app demand.? In mobile application market, user often searches for the app of oneself needs.But under conditions of such magnanimity, to as amateur For the public users of personnel, it may appear that the result much searched for not is the case where oneself is needed.Therefore, it is badly in need of a kind of side Method can provide some similar app while user query correlation app for user, possible to meet user Rough inquiry etc..Simultaneously in recommender system, it is some similar with app that is installing on user terminal to be actively that user recommends Mobile application recommends mobile application that the accuracy rate of recommendation can be improved according to the hobby of user.
The existing Similarity measures for application have the Similarity measures based on bottom code and interface.These are based on The Similarity measures of code layer can not directly reflect the semantic requirement of ordinary user, and the mobile application app for developing completion is Complete .apk file, can not obtain the code details of its bottom, therefore be not suitable with the current demand of user.
For the Similarity measures of application, while there are also the similarity calculation methods based on app content.It is most to be based on The similarity calculation method of content is the description information based on app, because description information is can to describe an app itself to compare The data of authority.But the calculation method of existing description information is generally based on bag of words to do.Bag of words do not have Consider the sequence between word and word, thus has ignored the context relation of many words, when calculating the similitude between vector, than Such as two near synonym, due to not being the same word, it is more likely that so that similitude becomes smaller and very big error occurs.
Meanwhile similitude is calculated in application, existing most methods are not by other such as titles, classification of app and greatly The information such as small are taken into account.And the comment information of such as app is also added thereto by some methods.According to us it has been observed that app Comment information quality it is excessively poor, be generally unable to respond the true content of app.
Therefore, for drawbacks described above present in current existing technology, it is necessary to it is studied, a kind of scheme is provided, Defect existing in the prior art is solved, enables similarity calculation method deeper dependent on app characteristic information.
Summary of the invention
The purpose of the present invention is to provide the similarity calculation methods of mobile application app a kind of, for preferably from magnanimity The most like app of some app is found in the library app, so as to improve app search accuracy rate and recommendation success rate.
To achieve the above object, the technical solution of the present invention is as follows:
A method of the calculating mobile application similitude based on content includes the following steps:
S10. a large amount of app data are crawled and carry out the feature arrangement of data, the feature put in order is saved in database, A feature database is established for inquiry;
S20. it according to the characteristic information of app to be checked, is inquired and is calculated in the feature database, found out to be checked The similar app of app;The characteristic information of the app to be checked is provided or is inquired from the feature database by user and obtained.
Further, step S10 the following steps are included:
S101. a large amount of app data are crawled, structuring is deposited into database after arranging;
S102. the description information of app each in the database is individually integrated into file, then segmented respectively;
S103. the data obtained after the completion of participle, copy merge as complete corpus, then use word2vec Carry out the training of corpus;Another copy carries out the calculating of TF-IDF between each document, obtains then according to original file structure The weight of all keywords in each document;
S104. by the keyword being calculated and its weight write-in HBase, each app packet name is corresponded to wherein going, column pair Should all keywords, be worth for keyword weight, establish feature database for inquiry;
S105. the title of app, type, the similitude of four aspect features of description and application size and with respective are calculated Weight integrated, the similarity last as algorithm.
Further, step S20 the following steps are included:
S201. the packet name for the app to be inquired is obtained as its unique name;
S202. in the feature Kuku in HBase, lateral inquiry is carried out according to the packet name of app, it is all to find out this app Keyword;
S203. for each keyword, K near synonym are expanded before finding out this keyword using word2vec respectively Exhibition;
S204., keyword after extension is carried out to the integration of weight, and picks out its top n keyword as this app's Absolute keyword;
S205. according to absolute keyword, by the feature database in column inquiry HBase, by the corresponding institute of the absolute keyword Some app are checked out, and the weight of app is integrated;
S206. the similar value for calculating separately title between these app and app to be checked, classification and size, then by this The similar value of description information, title, classification and size between a little app and app to be checked is integrated according to respective weight, As the similarity between these app and app to be checked;
S207. by the app after integration according to weight descending arrange, establish app similitude sequence, weight it is bigger be More similar app.
The beneficial effects of the present invention are: provide the similarity calculation method of mobile application app a kind of, for preferably from The most like app of some app is found in the library magnanimity app, so as to improve app search accuracy rate and recommendation success rate.Tool In terms of body surface is now following:
1) description information of app is used, while carrying out the calculating of near synonym using word2vec, it can not only be anti-well The specific semantic content of app is reflected, while can preferably excavate nearly justice therein in conjunction with the context relation in description information Word feature;
2) title, type, size and the description information of app are combined, the feature of app, while commenting app are sufficiently used By etc. the poor information of content foreclose, calculated result is more acurrate comprehensively;
3) use HBase as data warehouse carry out data inquiry, for the app data of magnanimity can faster into Row processing.
Detailed description of the invention
Fig. 1 is the flow diagram of the embodiment of the method for the calculating mobile application similitude of the invention based on content.
Specific embodiment
For a further understanding of the present invention, the preferred embodiment of the invention is described below with reference to embodiment, still It should be appreciated that these descriptions are only further explanation the features and advantages of the present invention, rather than to the claims in the present invention Limitation.
The method for the calculating mobile application similitude that the present invention provides a kind of based on content, dependent on app title, retouch State the features such as information, type and size, find with the most similar app of this app, specifically includes the following steps:
S1. a large amount of app data are crawled and carry out the feature arrangement of data, the feature put in order is saved in database, A feature database is established for inquiry;
S2. it according to the characteristic information of app to be checked, is inquired and is calculated in the feature database, find out app to be checked Similar app;The characteristic information of the app to be checked is provided or is inquired from the feature database by user and obtained.
With reference to embodiment, above content is described in further detail.
Step S10 crawls the relevant information of a large amount of app from network, including the title of app, classification, size and description Information, and by these information preservations into relevant database.
Step S20 extracts the description information of all app, and this information is divided into two parts, is respectively calculated, packet Include following steps:
S201 and S202, obtains all app data from database, and by its title, type, size and description information It reads out;
App description information is divided into the form of individual document by S203, and the content of each document is being removed stopping first Word and add retain word under the premise of segmented, its TF-IDF (Term Frequency-then is calculated to entire document sets Inverse Document Frequency, term frequency-inverse document frequency) value, obtain the keyword and its weight of each document;
App description information after all points of good words is formed a big document by S204, then as The training corpus of word2vec, is trained;
The resultant content of step S203 is deposited into HBase data warehouse by S205, to carry out based on app description letter The data retrieval of breath.Using the corresponding app packet name of each document as the rowkey of HBase, using all keywords as HBase's Column content.When storing the description information of the app after a calculating, packet name is as rowkey, all keyword conducts Corresponding column, while the weighted value of keyword is as the corresponding value of column.It is corresponding that certain app can not only quickly be searched in this way Information, while can dynamically extend its corresponding keyword, convenient search;
Step S30 looks for a kind of method that can will carry out weight adjustment according to correlation result, by the title of app, class Type, size and description information are integrated, and under conditions of keeping obtaining optimal similar app, calculate this using multiple groups case The optimal weight of four combinations of attributes.
Step S40, by data preparation after, the searching step for carrying out similar app can be started, content is further Include the following steps:
S401 obtains the packet name of app to be retrieved;
S402 retrieves it as row corresponding to rowkey, and therefrom according to app packet name to be retrieved in HBase Find its corresponding all keyword and weight;
All keywords are carried out synonym extension using the training result of word2vec, and will expand and by S403 Word calculate its weight weight, then identical word is merged, while weight is superimposed;
S404, according to the keyword and its weight expanded, after the weight in each word column is normalized, in HBase Its corresponding app is longitudinally searched in data warehouse.Each word corresponds to multiple app, then calculates the weight of each app, goes forward side by side Row integration, descending simultaneously filter out the multiple apps most like according to description information;
S405, the app obtained according to S404 extract the information such as their title, type, size;
S406 calculates app title using the method for editing distance and retrieves the similitude of the title of app;
S407 calculates the type of app and the similitude of retrieval app type using the method for Taxonomic discussion:
S408 calculates the size of app and the similitude of retrieval app.Use formula:
Wherein, a is app to be retrieved, and x is calculated based on the similar each app of description information, size in S404maxFor The size of the app of the occupancy maximum space of these similar app, sizeminFor the app of the occupancy minimum space of these similar app Size.
S409, the weight by the title of app, type, size and the calculated similarity of description information according to each attribute It is weighted integration, obtains final a similarity value namely following formula:
Similarity=λ1Simname2Simcategory3Simsize4Simdescription
Wherein, name refers to app title, and category refers to app type, and size refers to the size of app, and description refers to app Description information, respectively refers to the weight of calculated app title when integrating various aspects weight, type, size and description information, and has λ1234=1.
Then result is ranked up and is filtered according to Similarity value, obtain most like one or more to the end app。
The above description of the embodiment is only used to help understand the method for the present invention and its core ideas.It should be pointed out that pair For those skilled in the art, without departing from the principle of the present invention, the present invention can also be carried out Some improvements and modifications, these improvements and modifications also fall within the scope of protection of the claims of the present invention.

Claims (2)

1. a kind of method of the calculating mobile application similitude based on content, which comprises the steps of:
S10. a large amount of app data are crawled and carry out the feature arrangement of data, the feature put in order is saved in database, is established One feature database specifically includes for inquiry:
S101. a large amount of app data are crawled, structuring is deposited into database after arranging;
S102. the description information of app each in the database is individually integrated into file, then segmented respectively;
S103. the data obtained after the completion of participle, copy are merged as complete corpus, are then carried out using word2vec The training of corpus;Another copy carries out the calculating of TF-IDF between each document, obtains each then according to original file structure The weight of all keywords in document;
S104. by the keyword being calculated and its weight write-in HBase, wherein row corresponds to each app packet name, corresponding institute is arranged There is keyword, be worth for keyword weight, establishes feature database for inquiry;
S105. the title of app, type, the similitude of four aspect features of description and application size and with respective power are calculated It is integrated again, the similarity last as algorithm;
S20. it according to the characteristic information of app to be checked, is inquired and is calculated in the feature database, find out app's to be checked Similar app;The characteristic information of the app to be checked is provided or is inquired from the feature database by user and obtained.
2. the method for the calculating mobile application similitude according to claim 1 based on content, which is characterized in that step S20 the following steps are included:
S201. the packet name for the app to be inquired is obtained as its unique name;
S202. in the feature Kuku in HBase, lateral inquiry is carried out according to the packet name of app, finds out all keys of this app Word;
S203. for each keyword, K near synonym are extended before finding out this keyword using word2vec respectively;
S204., keyword after extension is carried out to the integration of weight, and picks out its top n keyword as the absolute of this app Keyword;
S205. by the feature database in column inquiry HBase, the absolute keyword is corresponding all according to absolute keyword App is checked out, and the weight of app is integrated;
S206. the similar value for calculating separately title between these app and app to be checked, classification and size, then by these The similar value of description information, title, classification and size between app and app to be checked is integrated according to respective weight, is made For the similarity between these app and app to be checked;
S207. the app after integration is arranged according to weight descending, establishes the similitude sequence of app, weight it is bigger be more phase As app.
CN201510776878.9A 2015-09-28 2015-11-13 A method of the calculating mobile application similitude based on content Active CN105677695B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510626874 2015-09-28
CN2015106268742 2015-09-28

Publications (2)

Publication Number Publication Date
CN105677695A CN105677695A (en) 2016-06-15
CN105677695B true CN105677695B (en) 2019-03-08

Family

ID=56946915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510776878.9A Active CN105677695B (en) 2015-09-28 2015-11-13 A method of the calculating mobile application similitude based on content

Country Status (1)

Country Link
CN (1) CN105677695B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255522A (en) * 2016-12-27 2018-07-06 北京金山云网络技术有限公司 A kind of application program sorting technique and device
CN108319449B (en) * 2017-01-16 2021-07-20 北京金山云网络技术有限公司 Method and device for determining application architecture
CN109002441A (en) * 2017-06-06 2018-12-14 阿里巴巴集团控股有限公司 Determination method, the exception of Apply Names similarity apply detection method and system
CN108170664B (en) * 2017-11-29 2021-04-09 有米科技股份有限公司 Key word expansion method and device based on key words
CN108170665B (en) * 2017-11-29 2021-06-04 有米科技股份有限公司 Keyword expansion method and device based on comprehensive similarity
CN108182201B (en) * 2017-11-29 2020-06-30 有米科技股份有限公司 Application expansion method and device based on key keywords
CN108804492B (en) * 2018-03-27 2022-04-29 阿里巴巴(中国)有限公司 Method and device for recommending multimedia objects
CN113868533B (en) * 2021-09-30 2025-04-22 北京达佳互联信息技术有限公司 Application search method, device, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530339A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Mobile application information push method and device
CN103955536A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Classification method and device of applications
CN104424307A (en) * 2013-09-04 2015-03-18 腾讯科技(深圳)有限公司 Intelligent terminal application classifying method, system and intelligent terminal,
CN104750798A (en) * 2015-03-19 2015-07-01 腾讯科技(深圳)有限公司 Application program recommendation method and device
CN104778178A (en) * 2014-01-13 2015-07-15 腾讯科技(深圳)有限公司 Application classification method, application classification device and service server
CN104866526A (en) * 2015-04-21 2015-08-26 惠州Tcl移动通信有限公司 Intelligent terminal and method for recommending applications thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8190634B2 (en) * 2006-10-10 2012-05-29 Canon Kabushiki Kaisha Image display controlling apparatus, method of controlling image display, and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424307A (en) * 2013-09-04 2015-03-18 腾讯科技(深圳)有限公司 Intelligent terminal application classifying method, system and intelligent terminal,
CN103530339A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Mobile application information push method and device
CN104778178A (en) * 2014-01-13 2015-07-15 腾讯科技(深圳)有限公司 Application classification method, application classification device and service server
CN103955536A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Classification method and device of applications
CN104750798A (en) * 2015-03-19 2015-07-01 腾讯科技(深圳)有限公司 Application program recommendation method and device
CN104866526A (en) * 2015-04-21 2015-08-26 惠州Tcl移动通信有限公司 Intelligent terminal and method for recommending applications thereof

Also Published As

Publication number Publication date
CN105677695A (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN105677695B (en) A method of the calculating mobile application similitude based on content
CN106227815B (en) Multi-modal clue personalized application program function recommendation method and system
CN107480158B (en) Method and system for evaluating matching of content item and image based on similarity score
CN105335519B (en) Model generation method and device and recommendation method and device
CN107145496B (en) Method for matching image with content item based on keyword
US9965793B1 (en) Item selection based on dimensional criteria
CN106156082B (en) A body alignment method and device
CN107145545B (en) Top-k area user text data recommendation method in social network based on position
CN103593474B (en) Image retrieval sort method based on deep learning
CN103455487B (en) The extracting method and device of a kind of search term
TW201241773A (en) Method and apparatus of determining product category information
CN107145497B (en) Method for selecting image matched with content based on metadata of image and content
EP2788896B1 (en) Fuzzy full text search
CN107463592B (en) Method, device and data processing system for matching a content item with an image
CN105468790B (en) A kind of comment information search method and device
Hauff et al. Placing images on the world map: a microblog-based enrichment approach
CN105426550B (en) Collaborative filtering label recommendation method and system based on user quality model
CN106156157B (en) Electronic book navigation system and method
CN102831224B (en) Generation method and device are suggested in a kind of method for building up in data directory library, search
CN103714092A (en) Geographic position searching method and geographic position searching device
CN104199875A (en) Search recommending method and device
CN104424257A (en) Information indexing unit and information indexing method
CN106294358A (en) The search method of a kind of information and system
EP2788897B1 (en) Optimally ranked nearest neighbor fuzzy full text search
CN106682190A (en) Construction method and device of label knowledge base, application search method and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant