CN102622451A - System for automatically generating television program labels - Google Patents
System for automatically generating television program labels Download PDFInfo
- Publication number
- CN102622451A CN102622451A CN2012101100313A CN201210110031A CN102622451A CN 102622451 A CN102622451 A CN 102622451A CN 2012101100313 A CN2012101100313 A CN 2012101100313A CN 201210110031 A CN201210110031 A CN 201210110031A CN 102622451 A CN102622451 A CN 102622451A
- Authority
- CN
- China
- Prior art keywords
- program
- module
- entry
- keyword
- programme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种电视节目标签自动生成系统,包括抓取与节目相关页面,对页面修剪过滤,得到描述节目信息主体内容的节目信息获取模块;汇总主体内容,并从中抽取关键词的信息关键词提取模块;建立词条间的网络关系,对获取的关键词进行扩展的知识库模块;利用知识库模块提供的网络,将关键词进行扩展得到更大词条集的关键词扩展模块;将得到的关键词的关联词条集进行处理、滤除噪声、计算分数,并生成节目标签集的标签生成模块。本发明填补了自动生成电视节目标签系统的空白,知识库的引入,也使得系统不会受制于网络页面,有更好的扩展性,对标签也有更好的发现力。知识库可以离线建立,标签生成算法简洁,故系统效率也很高。
The present invention provides a system for automatically generating TV program labels, including grabbing pages related to the program, pruning and filtering the pages, and obtaining a program information acquisition module describing the main content of the program information; summarizing the main content, and extracting key words from the information keywords The extraction module; the knowledge base module that establishes the network relationship between the entries and expands the acquired keywords; the keyword expansion module that expands the keywords to obtain a larger entry set by using the network provided by the knowledge base module; will get The associated term set of the keyword is processed, the noise is filtered out, the score is calculated, and the label generation module of the program label set is generated. The invention fills the blank of the system for automatically generating TV program tags, and the introduction of the knowledge base also makes the system not restricted by web pages, has better expansibility, and has better discoverability for tags. The knowledge base can be established offline, and the label generation algorithm is simple, so the system efficiency is also very high.
Description
技术领域 technical field
本发明涉及的是一种计算机应用技术领域的系统,具体是一种电视节目标签自动生成系统。The invention relates to a system in the field of computer application technology, in particular to a system for automatically generating TV program labels.
背景技术 Background technique
长久以来,如何帮助人们更好的做出选择,总是一个意义重大而又充满趣味的问题。人们做出选择是以一定的信息为基础的,将搜集到的信息与个人观念、爱好相结合,即产生了选择行为。然而,信息的获取并不简单。在网络尚不发达,信息交流并不便利的过去,信息的匮乏、比较的困难成为人们做出选择的障碍。而步入信息时代,信息的获取只需轻点鼠标即可完成,但是这却带来了另外的一个问题,信息泛滥。面对着海量的信息,单是对信息作辨别和筛选就将花费人们很多时间,这也造成了选择的障碍。为了解决这一问题,标签自动生成系统应运而生。通过对信息进行主体提取,内容总结,关键词分析,生成与信息对应的标签集。利用标签集,人们可以快速掌握信息大意,同时为信息分类提供依据,这都能帮助人们做出选择。For a long time, how to help people make better choices has always been a meaningful and interesting question. People make choices based on certain information. Combining the collected information with personal ideas and hobbies creates choice behavior. However, access to information is not simple. In the past when the Internet was not yet developed and information exchange was inconvenient, the lack of information and the difficulty of comparison became obstacles for people to make choices. In the information age, information acquisition can be completed with just a click of the mouse, but this brings another problem, the flood of information. In the face of massive amounts of information, it will take a lot of time for people to just identify and screen the information, which also creates barriers to choice. In order to solve this problem, automatic label generation system came into being. By extracting the main body of the information, summarizing the content, and analyzing keywords, a tag set corresponding to the information is generated. Using the label set, people can quickly grasp the general idea of information, and at the same time provide a basis for information classification, which can help people make choices.
目前,对标签自动生成系统的研究很多,但主要着重于文本的处理,即针对一篇文档,自动生成与之相关的标签。Jialie Shen[1]研究了音乐标签的自动生成,采用的方法主要是提取音频的特征,再以手动标注的音乐作为训练素材,通过机器学习的方法生成分类器,用这个分类器为音乐添加标签。Stefan Siersdorfer[2]提出了视频标签的补充方案,利用已有的视频比较技术,将相似视频已有的标签进行合并,不过这不是真正意义上的标签自动生成。因此,现阶段对视频添加标签还主要依赖于人工处理,对电视节目标签自动生成系统的研究还是一个空缺。At present, there are many researches on automatic tag generation system, but they mainly focus on text processing, that is, automatically generate tags related to a document. Jialie Shen[1] studied the automatic generation of music tags. The method used is mainly to extract the features of the audio, and then use the manually tagged music as the training material to generate a classifier through machine learning, and use this classifier to add tags to the music. . Stefan Siersdorfer[2] proposed a supplementary scheme for video tags, using existing video comparison technology to merge existing tags of similar videos, but this is not the real automatic generation of tags. Therefore, at this stage, adding tags to video mainly depends on manual processing, and the research on the automatic generation system of TV program tags is still a vacancy.
[1]Jialie Shen,Meng Wang,Shuicheng Yan,HweeHwa Pang,Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010;[1] Jialie Shen, Meng Wang, Shuicheng Yan, HweeHwa Pang, Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010;
[2]Stefan Siersdorfer,Jose San Pedro,Mark Sanderson Automatic VideoTagging using Content Redundancy SIGIR 2009。[2] Stefan Siersdorfer, Jose San Pedro, Mark Sanderson Automatic Video Tagging using Content Redundancy SIGIR 2009.
发明内容 Contents of the invention
本发明针对现有技术中存在的上述不足,提供了一种电视节目标签自动生成系统,仅需要为系统提供电视节目的名称,系统就可以自动从网上获取与该节目相关的信息,通过进一步对获取的信息进行归纳和扩展,系统将返回与该节目相关的一个标签集。Aiming at the above-mentioned deficiencies in the prior art, the present invention provides a system for automatically generating TV program labels. It only needs to provide the system with the name of the TV program, and the system can automatically obtain information related to the program from the Internet. The obtained information is summarized and expanded, and the system will return a tag set related to the program.
本发明是通过以下技术方案实现的。The present invention is achieved through the following technical solutions.
一种电视节目标签自动生成系统,包括依次连接的节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块,还包括与关键词扩展模块相连接的知识库模块,其中:An automatic generation system for TV program labels, comprising a sequentially connected program information acquisition module, information keyword extraction module, keyword expansion module and label generation module, and also includes a knowledge base module connected with the keyword expansion module, wherein:
-节目信息获取模块,用于从网上抓取与节目相关的页面,通过对页面的修剪和过滤,得到描述节目信息的主体内容;- The program information acquisition module is used to grab the pages related to the program from the Internet, and obtain the main content describing the program information by pruning and filtering the pages;
-信息关键词提取模块,用于汇总节目信息获取模块得到的主体内容,并从主体内容中抽取出关键词;- information keyword extraction module, used to summarize the main content obtained by the program information acquisition module, and extract keywords from the main content;
-知识库模块,用于建立词条间的网络关系,以便用于对获取的关键词进行扩展;- a knowledge base module, used to establish the network relationship between entries, so as to expand the acquired keywords;
-关键词扩展模块,用于利用知识库模块提供的网络,将信息关键词提取模块得到的关键词进行扩展,得到一个更大的词条集;-keyword extension module, used to utilize the network provided by the knowledge base module to expand the keywords obtained by the information keyword extraction module to obtain a larger entry set;
-标签生成模块,用于将得到的所有关键词的关联词条集进行处理,滤除噪声,计算分数,并最终生成节目的标签集。- A tag generation module, used to process the associated entry sets of all keywords obtained, filter out noise, calculate scores, and finally generate a tag set for the program.
所述节目信息获取模块包括HTML解析器,接收需要生成标签的目标电视节目集合,在搜索引擎的辅助下,为每个节目获取网络页面,所述页面通过HTML解析器的处理,得到主体内容,所述主体内容传递给信息关键词提取模块作进一步处理。Described program information acquisition module comprises HTML parser, receives the target television program set that needs to generate label, under the assistance of search engine, obtains network page for each program, and described page obtains main content through the processing of HTML parser, The main body content is passed to the information keyword extraction module for further processing.
所述信息关键词提取模块包括分词与词性标注器,得到描述每个节目信息的主体内容后,通过分词与词性标注器对内容进行划分,并仅保留名词词性的词语。The information keyword extraction module includes a word segmentation and part-of-speech tagger. After obtaining the main content describing each program information, the content is divided by the word segmentation and part-of-speech tagging device, and only the noun part-of-speech words are reserved.
所述名词词性的词语通过统计方法识别关键词。The words of the noun part of speech identify keywords through statistical methods.
所述统计方法包括以下步骤:Described statistical method comprises the following steps:
第一步,对于特定的某个节目,将词语划分为两组,一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面;In the first step, for a specific program, the words are divided into two groups, one group comes from web pages related to the program, and one group comes from other web pages in the program collection;
第二步,对这两组词语计算词频,并统计出均值和标准差,这样,每个词语都用4个统计量描述其特征,所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差;The second step is to calculate the word frequency for these two groups of words, and count the mean and standard deviation. In this way, each word uses 4 statistics to describe its characteristics. The mean and standard deviation of the word frequency of , and the mean and standard deviation of the word frequency of this word on pages not related to the program;
第三步,根据4个统计量间的关系,将最能表现节目特征的关键词识别出来。In the third step, according to the relationship among the four statistics, the keywords that can best represent the characteristics of the program are identified.
所述知识库模块以百度百科作为数据源,以图的形式进行存储。The knowledge base module uses Baidu Encyclopedia as a data source and stores it in the form of a graph.
所述百度百科的组织方式包括以下步骤:The organization method of Baidu Encyclopedia includes the following steps:
第一步,对于每个词条,均有一个页面对该词条进行描述,页面中除了纯文本外,还会将百度百科中已有的其他词条作引用;In the first step, for each entry, there is a page describing the entry. In addition to the plain text, the page will also refer to other existing entries in Baidu Encyclopedia;
第二步,在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边,对这个图应用PageRank算法,得到每个词条的重要性;In the second step, in the graph of the knowledge base, there will be a directed edge between each such described entry and the referenced entry, and the PageRank algorithm is applied to this graph to obtain the importance of each entry;
第三步,词条的权重和词条间的相互引用关系,构成了整个知识库。In the third step, the weight of entries and the mutual reference relationship between entries constitute the entire knowledge base.
所述关键词扩展模块对每个信息关键词提取模块得到的关键词,在知识库模块的图中找到与之存在一条路径的其他词条,根据词条本身的重要性和词条与关键词的距离,计算出词条的权重。The keyword expansion module finds other entries that have a path with it in the figure of the knowledge base module for the keywords obtained by each information keyword extraction module, and according to the importance of the entry itself and the entry and keyword The distance of the term is calculated to calculate the weight of the entry.
所述标签生成模块将所有得到的关键词的关联词条合并在一起,当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序,并根据需要返回前面的若干个,从而得到了描述节目特征的标签集。The label generating module combines all the associated entries of the obtained keywords together, and when an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. All the entries are sorted according to the sum of the weights, and the previous ones are returned as needed, so as to obtain the label set describing the characteristics of the program.
本发明工作时,先为系统提供需要生成标签的目标电视节目集合。节目信息获取模块在搜索引擎的辅助下,为每个节目都获取一定量的网络页面,这些页面通过模块中HTML解析器的处理,得到主体内容,这些主体内容将传递给信息关键词提取模块作进一步处理。信息关键词提取模块得到描述每个节目信息的主体内容后,通过模块中的分词与词性标注器对内容进行划分,并仅保留名词词性的词语。这些词语将以统计的方法识别出关键词。统计方法如下:对于特定的某个节目,将词语划分为两组。一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面。对这两组词语都计算词频,并统计出均值和标准差。这样,每个词语都用4个统计量描述其特征。分别是这个词语在与节目相关页面的词频均值,标准差以及这个词语在与节目不相关页面的词频均值和标准差。根据4个统计量间的关系,就可以将最能表现节目特征的关键词识别出来。通过网络页面提取出来的关键词已经能在一定程度上反映节目的特征,但缺陷在于得到的关键词的范围有限,即它们必须出现在网络页面上。针对这一限制,本发明很重要的一点就是引入了知识库模块。知识库模块以百度百科作为数据源,以图的形式进行存储。百度百科的组织方式为,对于每个词条,都有一个页面对该词条进行描述,页面中除了纯文本外,还会对百度百科中已有的其他词条作引用。在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边。对这个图应用PageRank算法,我们就得到了每个词条的重要性。词条的权重和词条间的相互引用关系,构成了整个知识库。这样,关键词扩展模块的任务很简单,对于每个信息关键词提取模块得到的关键词,都可以在知识库的图中找到与之存在一条路径的其他词条,根据词条本身的重要性和词条与关键词的距离,计算出词条的权重。标签生成模块是系统的最后一个环节,在信息关键词提取模块,我们得到了能够反映节目特征的关键词集,在关键词扩展模块,我们得到了每个关键词关联的词条集,而且每个词条都有权重。标签生成模块负责将两部分结果整合起来,即将所有得到的关键词的关联词条合并在一起。当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序,并根据需要返回前面的若干个,我们就得到了描述节目特征的标签集了。When the present invention works, the target television program collection that needs to generate tags is first provided for the system. With the assistance of the search engine, the program information acquisition module acquires a certain amount of web pages for each program. These pages are processed by the HTML parser in the module to obtain the main content, and these main content will be passed to the information keyword extraction module for further processing. After the information keyword extraction module obtains the main content describing each program information, it divides the content through the word segmentation and part-of-speech tagger in the module, and only retains the noun part-of-speech words. These words will statistically identify keywords. The statistical method is as follows: for a specific program, the words are divided into two groups. One set comes from web pages related to the show, and one set comes from other web pages in the show collection. The word frequency is calculated for both groups of words, and the mean and standard deviation are calculated. In this way, each word is characterized by 4 statistics. They are the mean and standard deviation of the word frequency of the word on the page related to the program, and the mean and standard deviation of the word frequency of the word on the page not related to the program. According to the relationship among the four statistics, the keywords that can best represent the characteristics of the program can be identified. The keywords extracted through the web pages can already reflect the characteristics of the program to a certain extent, but the disadvantage is that the range of the obtained keywords is limited, that is, they must appear on the web pages. Aiming at this limitation, a very important point of the present invention is to introduce a knowledge base module. The knowledge base module uses Baidu Encyclopedia as the data source and stores it in the form of a graph. Baidu Encyclopedia is organized in such a way that for each entry, there is a page describing the entry. In addition to the plain text, the page will also refer to other existing entries in Baidu Encyclopedia. In the knowledge base graph, there will be a directed edge between each such described term and the referenced term. Applying the PageRank algorithm to this graph, we get the importance of each term. The weight of entries and the mutual reference relationship between entries constitute the entire knowledge base. In this way, the task of the keyword expansion module is very simple. For each keyword obtained by the information keyword extraction module, other entries that have a path with it can be found in the graph of the knowledge base. According to the importance of the entry itself and the distance between the entry and the keyword to calculate the weight of the entry. The tag generation module is the last link of the system. In the information keyword extraction module, we get the keyword set that can reflect the characteristics of the program. In the keyword expansion module, we get the entry set associated with each keyword, and each Each entry has weight. The tag generation module is responsible for integrating the two parts of the results, that is, merging all associated entries of the obtained keywords together. When an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. Sort all the entries according to the sum of weights, and return the previous ones as needed, and we get the label set describing the characteristics of the program.
与现有技术相比,本发明填补了自动生成电视节目标签系统的空白,知识库的引入,也使得系统不会受制于网络页面,有更好的扩展性,对标签也有更好的发现力。知识库可以离线建立,标签生成算法简洁,故系统效率也很高。Compared with the prior art, the present invention fills the gap in the automatic generation of TV program labeling system, and the introduction of the knowledge base also makes the system not restricted by web pages, has better expansibility, and has better discoverability for labels . The knowledge base can be established offline, and the label generation algorithm is simple, so the system efficiency is also very high.
附图说明 Description of drawings
图1示出本发明的系统模块框图;Fig. 1 shows a system block diagram of the present invention;
图2示出本发明节目信息获取模块的实施细节;Fig. 2 shows the implementation details of the program information acquisition module of the present invention;
图3示出本发明信息关键词提取模块中词条列表的生成细节;Fig. 3 shows the generation details of the entry list in the information keyword extraction module of the present invention;
图4示出本发明信息关键词提取模块中关键词的生成细节。Fig. 4 shows the details of keyword generation in the information keyword extraction module of the present invention.
具体实施方式 Detailed ways
下面结合附图对本发明的实施例作详细说明,本实施例在以发明技术方案为前提下进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。The embodiments of the present invention are described in detail below in conjunction with the accompanying drawings. This embodiment is implemented on the premise of the technical solution of the invention, and detailed implementation methods and specific operating procedures are provided, but the protection scope of the present invention is not limited to the following the embodiment.
本实施例的任务是为一组电视节目自动生成标签,分别是节目1、节目2、节目3、节目4、节目5、节目6、节目7、节目8、节目9、节目10。The task of this embodiment is to automatically generate tags for a group of TV programs, which are program 1, program 2, program 3, program 4, program 5, program 6, program 7, program 8, program 9, and program 10.
如图1所示,本实施例包括5个模块:节目信息获取模块、信息关键词提取模块、知识库模块、关键词扩展模块、标签生成模块,其中,节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块依次连接,知识库模块与关键词扩展模块相连接。所述节目信息获取模块,负责从网上抓取与这10个节目相关的页面,通过对页面的修剪和过滤,得到描述节目信息的主体内容。所述信息关键词提取模块,负责汇总节目信息获取模块得到的主体内容,并从主体内容中抽取出关键词。所述知识库模块,负责建立词条间的网络关系,以便用于对获取的关键词进行扩展。所述关键词扩展模块,负责利用知识库模块提供的网络,将信息关键词提取模块得到的关键词进行扩展,得到一个更大的词条集。所述标签生成模块,负责词条集进行处理,滤除噪声,计算分数,并最终生成节目的标签集。As shown in Figure 1, this embodiment includes 5 modules: program information acquisition module, information keyword extraction module, knowledge base module, keyword expansion module, label generation module, wherein, program information acquisition module, information keyword extraction module The keyword expansion module and the tag generation module are connected in sequence, and the knowledge base module is connected with the keyword expansion module. The program information acquisition module is responsible for grabbing pages related to these 10 programs from the Internet, and obtaining the main content describing the program information by pruning and filtering the pages. The information keyword extraction module is responsible for summarizing the main content obtained by the program information acquisition module, and extracting keywords from the main content. The knowledge base module is responsible for establishing the network relationship between entries, so as to expand the acquired keywords. The keyword expansion module is responsible for using the network provided by the knowledge base module to expand the keywords obtained by the information keyword extraction module to obtain a larger entry set. The label generation module is responsible for processing the entry set, filtering out noise, calculating scores, and finally generating the label set of the program.
如图2所示,节目信息获取模块包括HTML解析器,接收需要生成标签的目标电视节目集合,在搜索引擎的辅助下,为每个节目获取网络页面,所述页面通过HTML解析器的处理,得到主体内容,所述主体内容传递给信息关键词提取模块作进一步处理。具体为,节目信息获取模块利用搜索引擎,得到与目标节目相关的10个页面,即HTML文件。通过去除得到的HTML文件中如广告、图片、标题、脚本等的无用标记,我们就得到了描述节目信息的10个文档。As shown in Figure 2, the program information acquisition module includes an HTML parser to receive the set of target TV programs that need to generate tags, and with the assistance of a search engine, obtain a web page for each program, and the pages are processed by the HTML parser. The main content is obtained, and the main content is passed to the information keyword extraction module for further processing. Specifically, the program information acquisition module uses a search engine to obtain 10 pages related to the target program, that is, HTML files. By removing useless tags such as advertisements, pictures, titles, scripts, etc. in the obtained HTML files, we have obtained 10 files describing program information.
如图3所示,信息关键词提取模块包括分词与词性标注器,得到描述每个节目信息的主体内容后,通过分词与词性标注器对内容进行划分,并仅保留名词词性的词语。具体为,节目信息获取模块返回的文档会先通过信息关键词提取模块进行分词和词性标注的处理,并仅保留名词词性的词语,这样每个文档都被转换成一个词集。一个节目对应的10个文档会有重复的词语,所以10个文档的词语将进行哈希处理,统计出每个词语的在每个文档中的词频。最后我们针对每个节目都会得到一个词条列表,列表中的每一项是一个数据结构,包含词条的文本内容和该词条在10个文档中的词频。As shown in Figure 3, the information keyword extraction module includes word segmentation and part-of-speech tagging. After obtaining the main content describing each program information, the content is divided by word segmentation and part-of-speech tagging, and only noun words are reserved. Specifically, the documents returned by the program information acquisition module will first be processed by word segmentation and part-of-speech tagging through the information keyword extraction module, and only the noun part-of-speech words will be retained, so that each document is converted into a word set. The 10 documents corresponding to a program will have repeated words, so the words in the 10 documents will be hashed, and the word frequency of each word in each document will be counted. Finally, we will get a list of entries for each program, and each item in the list is a data structure, including the text content of the entry and the word frequency of the entry in 10 documents.
需要说明的是,名词词性的词语通过统计方法识别关键词。It should be noted that the words of the noun part of speech identify keywords through statistical methods.
统计方法包括以下步骤:The statistical method includes the following steps:
第一步,对于特定的某个节目,将词语划分为两组,一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面;In the first step, for a specific program, the words are divided into two groups, one group comes from web pages related to the program, and one group comes from other web pages in the program collection;
第二步,对这两组词语计算词频,并统计出均值和标准差,这样,每个词语都用4个统计量描述其特征,所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差;The second step is to calculate the word frequency for these two groups of words, and count the mean and standard deviation. In this way, each word uses 4 statistics to describe its characteristics. The mean and standard deviation of the word frequency of , and the mean and standard deviation of the word frequency of this word on pages not related to the program;
第三步,根据4个统计量间的关系,将最能表现节目特征的关键词识别出来。In the third step, according to the relationship among the four statistics, the keywords that can best represent the characteristics of the program are identified.
如图4所示,得到的词条列表经过进一步处理得到最终的关键词列表。这里,对于目标节目词条列表中的每一个词语,都计算出4个统计量,分别是:该词语在目标节目中的词频均值和标准差,该词语在其他节目中的词频均值和标准差。得到4个统计量后,先以这样的规则对词语进行归类:As shown in Fig. 4, the obtained word list is further processed to obtain the final keyword list. Here, for each word in the target program entry list, four statistics are calculated, namely: the word frequency mean and standard deviation of the word in the target program, the word frequency mean and standard deviation of the word in other programs . After obtaining the 4 statistics, first classify the words according to the following rules:
第一类:在其他节目中词频均值和标准差都是0;The first category: the mean and standard deviation of word frequency in other programs are both 0;
第二类:在其他节目中词频均值和标准差都不为0,而且目标节目中的均值比其他节目的均值大以及标准差比其他节目的小;The second category: the word frequency mean and standard deviation are both 0 in other programs, and the mean value in the target program is larger than the mean value of other programs and the standard deviation is smaller than that of other programs;
第三类:第一类和第二类之外的情况。The third category: the situation other than the first and second categories.
每一类再以这样的规则计算分数:Each category then calculates the score according to the following rules:
第一类:目标节目中的均值除以标准差;The first category: the mean in the target program divided by the standard deviation;
第二类:目标节目中的均值乘以其他节目的标准差除以目标节目中的标准差再除以其他节目的均值。The second type: the mean value in the target program multiplied by the standard deviation of other programs divided by the standard deviation in the target program and then divided by the mean value of other programs.
第三类:直接设为0。The third category: directly set to 0.
接下来对词语进行排序,第一类优于第二类,第二类优于第三类,同类别中按分数再排序,最后输出前20个词语构成关键词列表。Next, the words are sorted, the first category is better than the second category, the second category is better than the third category, and the same category is re-sorted according to the score, and finally the top 20 words are output to form a keyword list.
知识库模块以百度百科作为数据源,以图的形式进行存储。The knowledge base module uses Baidu Encyclopedia as the data source and stores it in the form of a graph.
需要说明的是:百度百科的组织方式包括以下步骤:It should be noted that the organization of Baidu Encyclopedia includes the following steps:
第一步,对于每个词条,均有一个页面对该词条进行描述,页面中除了纯文本外,还会将百度百科中已有的其他词条作引用;In the first step, for each entry, there is a page describing the entry. In addition to the plain text, the page will also refer to other existing entries in Baidu Encyclopedia;
第二步,在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边,对这个图应用PageRank算法,得到每个词条的重要性;In the second step, in the graph of the knowledge base, there will be a directed edge between each such described entry and the referenced entry, and the PageRank algorithm is applied to this graph to obtain the importance of each entry;
第三步,词条的权重和词条间的相互引用关系,构成了整个知识库。In the third step, the weight of entries and the mutual reference relationship between entries constitute the entire knowledge base.
关键词列表中的每个关键词通过关键词扩展模块会得到关联的词条集,而且每个词条都有权重。标签生成模块会将两部分结果整合起来,即将所有得到的关键词的关联词条合并在一起。当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序,并返回前20个词条,我们就得到了描述节目特征的标签集了。Each keyword in the keyword list will get an associated entry set through the keyword expansion module, and each entry has a weight. The tag generation module will integrate the two parts of the results, that is, merge all associated entries of the obtained keywords together. When an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. Sort all entries according to the sum of weights, and return the top 20 entries, we get the label set describing the characteristics of the program.
对实验例中的10节目重复以上过程,我们就完成了为这些节目自动生成标签的任务。Repeat the above process for the 10 programs in the experimental example, and we have completed the task of automatically generating labels for these programs.
以上对本发明的具体实施例进行了描述。需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变形或修改,这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2012101100313A CN102622451A (en) | 2012-04-16 | 2012-04-16 | System for automatically generating television program labels |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2012101100313A CN102622451A (en) | 2012-04-16 | 2012-04-16 | System for automatically generating television program labels |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN102622451A true CN102622451A (en) | 2012-08-01 |
Family
ID=46562369
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2012101100313A Pending CN102622451A (en) | 2012-04-16 | 2012-04-16 | System for automatically generating television program labels |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102622451A (en) |
Cited By (23)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103152633A (en) * | 2013-03-25 | 2013-06-12 | 天脉聚源(北京)传媒科技有限公司 | Method and device for identifying key word |
| CN103186662A (en) * | 2012-12-28 | 2013-07-03 | 中联竞成(北京)科技有限公司 | System and method for extracting dynamic public sentiment keywords |
| CN103686406A (en) * | 2013-12-03 | 2014-03-26 | 青岛海信传媒网络技术有限公司 | Method and device for digital television to control intelligent terminal to display information |
| CN104978400A (en) * | 2015-06-04 | 2015-10-14 | 无锡天脉聚源传媒科技有限公司 | Method for generating video album name and apparatus |
| CN104978403A (en) * | 2015-06-04 | 2015-10-14 | 无锡天脉聚源传媒科技有限公司 | Generating method and apparatus for name of video album |
| CN104978332A (en) * | 2014-04-04 | 2015-10-14 | 腾讯科技(深圳)有限公司 | UGC label data generating method, UGC label data generating device, relevant method and relevant device |
| CN105704573A (en) * | 2014-09-25 | 2016-06-22 | 财团法人资讯工业策进会 | Television program shopping guide system and method thereof |
| CN105847948A (en) * | 2016-03-28 | 2016-08-10 | 乐视控股(北京)有限公司 | Data processing method and device |
| CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
| CN107291930A (en) * | 2017-06-29 | 2017-10-24 | 环球智达科技(北京)有限公司 | The computational methods of weight number |
| CN107302726A (en) * | 2017-06-30 | 2017-10-27 | 环球智达科技(北京)有限公司 | The label generating method of programme information |
| CN107333370A (en) * | 2017-08-10 | 2017-11-07 | 佛山市三水区彦海通信工程有限公司 | A kind of intelligent atmosphere lamp adjusting method |
| CN107844526A (en) * | 2017-10-12 | 2018-03-27 | 广州艾媒数聚信息咨询股份有限公司 | A kind of lexical relation link analysis method, system and device in knowledge based storehouse |
| CN107908654A (en) * | 2017-10-12 | 2018-04-13 | 广州艾媒数聚信息咨询股份有限公司 | A kind of recommendation method, system and device in knowledge based storehouse |
| CN108009293A (en) * | 2017-12-26 | 2018-05-08 | 北京百度网讯科技有限公司 | Video tab generation method, device, computer equipment and storage medium |
| CN108446276A (en) * | 2018-03-21 | 2018-08-24 | 腾讯音乐娱乐科技(深圳)有限公司 | The method and apparatus for determining the single keyword of song |
| CN109635171A (en) * | 2018-12-13 | 2019-04-16 | 成都索贝数码科技股份有限公司 | A kind of fusion reasoning system and method for news program intelligent label |
| CN109800323A (en) * | 2019-02-19 | 2019-05-24 | 标贝(深圳)科技有限公司 | A kind of voice data management method, system and storage medium |
| CN109920409A (en) * | 2019-02-19 | 2019-06-21 | 标贝(深圳)科技有限公司 | A kind of speech search method, device, system and storage medium |
| CN110019955A (en) * | 2017-12-15 | 2019-07-16 | 青岛聚看云科技有限公司 | A kind of video tab mask method and device |
| CN110225404A (en) * | 2019-06-17 | 2019-09-10 | 深圳市正易龙科技有限公司 | Video broadcasting method, terminal and computer readable storage medium |
| CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
| CN111491206A (en) * | 2020-04-17 | 2020-08-04 | 维沃移动通信有限公司 | Video processing method, video processing device and electronic equipment |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1596406A (en) * | 2001-11-28 | 2005-03-16 | 皇家飞利浦电子股份有限公司 | System and method for retrieving information related to targeted subjects |
| CN1640131A (en) * | 2002-02-25 | 2005-07-13 | 皇家飞利浦电子股份有限公司 | Method and system for retrieving information about television programs |
| WO2010117213A2 (en) * | 2009-04-10 | 2010-10-14 | Samsung Electronics Co., Ltd. | Apparatus and method for providing information related to broadcasting programs |
| CN102075695A (en) * | 2010-12-30 | 2011-05-25 | 中国科学院自动化研究所 | New generation intelligent cataloging system and method facing large amount of broadcast television programs |
| CN102207945A (en) * | 2010-05-11 | 2011-10-05 | 天津海量信息技术有限公司 | Knowledge network-based text indexing system and method |
-
2012
- 2012-04-16 CN CN2012101100313A patent/CN102622451A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1596406A (en) * | 2001-11-28 | 2005-03-16 | 皇家飞利浦电子股份有限公司 | System and method for retrieving information related to targeted subjects |
| CN1640131A (en) * | 2002-02-25 | 2005-07-13 | 皇家飞利浦电子股份有限公司 | Method and system for retrieving information about television programs |
| WO2010117213A2 (en) * | 2009-04-10 | 2010-10-14 | Samsung Electronics Co., Ltd. | Apparatus and method for providing information related to broadcasting programs |
| CN102207945A (en) * | 2010-05-11 | 2011-10-05 | 天津海量信息技术有限公司 | Knowledge network-based text indexing system and method |
| CN102075695A (en) * | 2010-12-30 | 2011-05-25 | 中国科学院自动化研究所 | New generation intelligent cataloging system and method facing large amount of broadcast television programs |
Cited By (31)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103186662B (en) * | 2012-12-28 | 2016-08-03 | 北京中油网资讯技术有限公司 | A kind of dynamically public sentiment keyword abstraction system and method |
| CN103186662A (en) * | 2012-12-28 | 2013-07-03 | 中联竞成(北京)科技有限公司 | System and method for extracting dynamic public sentiment keywords |
| CN103152633A (en) * | 2013-03-25 | 2013-06-12 | 天脉聚源(北京)传媒科技有限公司 | Method and device for identifying key word |
| CN103152633B (en) * | 2013-03-25 | 2015-12-23 | 天脉聚源(北京)传媒科技有限公司 | A kind of recognition methods of keyword and device |
| CN103686406A (en) * | 2013-12-03 | 2014-03-26 | 青岛海信传媒网络技术有限公司 | Method and device for digital television to control intelligent terminal to display information |
| CN104978332B (en) * | 2014-04-04 | 2019-06-14 | 腾讯科技(深圳)有限公司 | User-generated content label data generation method, device and correlation technique and device |
| CN104978332A (en) * | 2014-04-04 | 2015-10-14 | 腾讯科技(深圳)有限公司 | UGC label data generating method, UGC label data generating device, relevant method and relevant device |
| CN105704573A (en) * | 2014-09-25 | 2016-06-22 | 财团法人资讯工业策进会 | Television program shopping guide system and method thereof |
| CN104978400A (en) * | 2015-06-04 | 2015-10-14 | 无锡天脉聚源传媒科技有限公司 | Method for generating video album name and apparatus |
| CN104978403A (en) * | 2015-06-04 | 2015-10-14 | 无锡天脉聚源传媒科技有限公司 | Generating method and apparatus for name of video album |
| CN105847948A (en) * | 2016-03-28 | 2016-08-10 | 乐视控股(北京)有限公司 | Data processing method and device |
| CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
| CN107291930A (en) * | 2017-06-29 | 2017-10-24 | 环球智达科技(北京)有限公司 | The computational methods of weight number |
| CN107302726A (en) * | 2017-06-30 | 2017-10-27 | 环球智达科技(北京)有限公司 | The label generating method of programme information |
| CN107333370A (en) * | 2017-08-10 | 2017-11-07 | 佛山市三水区彦海通信工程有限公司 | A kind of intelligent atmosphere lamp adjusting method |
| CN107844526A (en) * | 2017-10-12 | 2018-03-27 | 广州艾媒数聚信息咨询股份有限公司 | A kind of lexical relation link analysis method, system and device in knowledge based storehouse |
| CN107908654B (en) * | 2017-10-12 | 2021-12-07 | 广州艾媒数聚信息咨询股份有限公司 | Knowledge base-based recommendation method, system and device |
| CN107908654A (en) * | 2017-10-12 | 2018-04-13 | 广州艾媒数聚信息咨询股份有限公司 | A kind of recommendation method, system and device in knowledge based storehouse |
| CN107844526B (en) * | 2017-10-12 | 2022-04-01 | 广州艾媒数聚信息咨询股份有限公司 | Knowledge base-based vocabulary relation chain analysis method, system and device |
| CN110019955A (en) * | 2017-12-15 | 2019-07-16 | 青岛聚看云科技有限公司 | A kind of video tab mask method and device |
| CN108009293A (en) * | 2017-12-26 | 2018-05-08 | 北京百度网讯科技有限公司 | Video tab generation method, device, computer equipment and storage medium |
| CN108446276A (en) * | 2018-03-21 | 2018-08-24 | 腾讯音乐娱乐科技(深圳)有限公司 | The method and apparatus for determining the single keyword of song |
| CN109635171A (en) * | 2018-12-13 | 2019-04-16 | 成都索贝数码科技股份有限公司 | A kind of fusion reasoning system and method for news program intelligent label |
| CN109635171B (en) * | 2018-12-13 | 2022-11-29 | 成都索贝数码科技股份有限公司 | A Fusion Reasoning System and Method for Smart Tags of News Programs |
| CN109800323A (en) * | 2019-02-19 | 2019-05-24 | 标贝(深圳)科技有限公司 | A kind of voice data management method, system and storage medium |
| CN109920409B (en) * | 2019-02-19 | 2021-07-09 | 标贝(深圳)科技有限公司 | Sound retrieval method, device, system and storage medium |
| CN109920409A (en) * | 2019-02-19 | 2019-06-21 | 标贝(深圳)科技有限公司 | A kind of speech search method, device, system and storage medium |
| CN110225404A (en) * | 2019-06-17 | 2019-09-10 | 深圳市正易龙科技有限公司 | Video broadcasting method, terminal and computer readable storage medium |
| CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
| CN111090754B (en) * | 2019-11-20 | 2023-04-07 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
| CN111491206A (en) * | 2020-04-17 | 2020-08-04 | 维沃移动通信有限公司 | Video processing method, video processing device and electronic equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102622451A (en) | System for automatically generating television program labels | |
| CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
| Kawade et al. | Sentiment analysis: machine learning approach | |
| CN103226578B (en) | A Method for Website Identification and Webpage Segmentation in the Medical Field | |
| KR100505848B1 (en) | Search System | |
| CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
| CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
| CN105404699A (en) | Method, device and server for searching articles of finance and economics | |
| CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
| CN101661513A (en) | Detection method of network focus and public sentiment | |
| CN105447081A (en) | A cloud platform-oriented government public opinion monitoring method | |
| CN102236867A (en) | Cloud computing-based audience behavioral analysis advertisement targeting system | |
| CN102708096A (en) | Network intelligence public sentiment monitoring system based on semantics and work method thereof | |
| CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
| CN102955771A (en) | Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode | |
| CN107943514A (en) | The method for digging and system of core code element in a kind of software document | |
| CN103870495B (en) | Method and device for extracting information from website | |
| CN113705231A (en) | Hot news discovery system and method | |
| CN115017302A (en) | A public opinion monitoring method and public opinion monitoring system | |
| CN104598561A (en) | Text-based intelligent agricultural video classification method and text-based intelligent agricultural video classification system | |
| CN107145591A (en) | Title-based webpage effective metadata content extraction method | |
| Varlamis et al. | An automatic wrapper generation process for large scale crawling of news websites | |
| CN104978431B (en) | Web data fusion method and device | |
| Roul et al. | An effective approach for web document classification using the concept of association analysis of data mining | |
| CN1350245A (en) | Personal homepage content safety monitoring method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C12 | Rejection of a patent application after its publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120801 |