CN102622451A

CN102622451A - System for automatically generating television program labels

Info

Publication number: CN102622451A
Application number: CN2012101100313A
Authority: CN
Inventors: 朱其立; 蔡智源; 王拯
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2012-04-16
Filing date: 2012-04-16
Publication date: 2012-08-01

Abstract

The present invention provides a system for automatically generating TV program labels, including grabbing pages related to the program, pruning and filtering the pages, and obtaining a program information acquisition module describing the main content of the program information; summarizing the main content, and extracting key words from the information keywords The extraction module; the knowledge base module that establishes the network relationship between the entries and expands the acquired keywords; the keyword expansion module that expands the keywords to obtain a larger entry set by using the network provided by the knowledge base module; will get The associated term set of the keyword is processed, the noise is filtered out, the score is calculated, and the label generation module of the program label set is generated. The invention fills the blank of the system for automatically generating TV program tags, and the introduction of the knowledge base also makes the system not restricted by web pages, has better expansibility, and has better discoverability for tags. The knowledge base can be established offline, and the label generation algorithm is simple, so the system efficiency is also very high.

Description

Automatic Generation System of TV Program Labels

技术领域 technical field

本发明涉及的是一种计算机应用技术领域的系统，具体是一种电视节目标签自动生成系统。The invention relates to a system in the field of computer application technology, in particular to a system for automatically generating TV program labels.

背景技术 Background technique

长久以来，如何帮助人们更好的做出选择，总是一个意义重大而又充满趣味的问题。人们做出选择是以一定的信息为基础的，将搜集到的信息与个人观念、爱好相结合，即产生了选择行为。然而，信息的获取并不简单。在网络尚不发达，信息交流并不便利的过去，信息的匮乏、比较的困难成为人们做出选择的障碍。而步入信息时代，信息的获取只需轻点鼠标即可完成，但是这却带来了另外的一个问题，信息泛滥。面对着海量的信息，单是对信息作辨别和筛选就将花费人们很多时间，这也造成了选择的障碍。为了解决这一问题，标签自动生成系统应运而生。通过对信息进行主体提取，内容总结，关键词分析，生成与信息对应的标签集。利用标签集，人们可以快速掌握信息大意，同时为信息分类提供依据，这都能帮助人们做出选择。For a long time, how to help people make better choices has always been a meaningful and interesting question. People make choices based on certain information. Combining the collected information with personal ideas and hobbies creates choice behavior. However, access to information is not simple. In the past when the Internet was not yet developed and information exchange was inconvenient, the lack of information and the difficulty of comparison became obstacles for people to make choices. In the information age, information acquisition can be completed with just a click of the mouse, but this brings another problem, the flood of information. In the face of massive amounts of information, it will take a lot of time for people to just identify and screen the information, which also creates barriers to choice. In order to solve this problem, automatic label generation system came into being. By extracting the main body of the information, summarizing the content, and analyzing keywords, a tag set corresponding to the information is generated. Using the label set, people can quickly grasp the general idea of information, and at the same time provide a basis for information classification, which can help people make choices.

目前，对标签自动生成系统的研究很多，但主要着重于文本的处理，即针对一篇文档，自动生成与之相关的标签。Jialie Shen[1]研究了音乐标签的自动生成，采用的方法主要是提取音频的特征，再以手动标注的音乐作为训练素材，通过机器学习的方法生成分类器，用这个分类器为音乐添加标签。Stefan Siersdorfer[2]提出了视频标签的补充方案，利用已有的视频比较技术，将相似视频已有的标签进行合并，不过这不是真正意义上的标签自动生成。因此，现阶段对视频添加标签还主要依赖于人工处理，对电视节目标签自动生成系统的研究还是一个空缺。At present, there are many researches on automatic tag generation system, but they mainly focus on text processing, that is, automatically generate tags related to a document. Jialie Shen[1] studied the automatic generation of music tags. The method used is mainly to extract the features of the audio, and then use the manually tagged music as the training material to generate a classifier through machine learning, and use this classifier to add tags to the music. . Stefan Siersdorfer[2] proposed a supplementary scheme for video tags, using existing video comparison technology to merge existing tags of similar videos, but this is not the real automatic generation of tags. Therefore, at this stage, adding tags to video mainly depends on manual processing, and the research on the automatic generation system of TV program tags is still a vacancy.

[1]Jialie Shen，Meng Wang，Shuicheng Yan，HweeHwa Pang，Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010；[1] Jialie Shen, Meng Wang, Shuicheng Yan, HweeHwa Pang, Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010;

[2]Stefan Siersdorfer，Jose San Pedro，Mark Sanderson Automatic VideoTagging using Content Redundancy SIGIR 2009。[2] Stefan Siersdorfer, Jose San Pedro, Mark Sanderson Automatic Video Tagging using Content Redundancy SIGIR 2009.

发明内容 Contents of the invention

本发明针对现有技术中存在的上述不足，提供了一种电视节目标签自动生成系统，仅需要为系统提供电视节目的名称，系统就可以自动从网上获取与该节目相关的信息，通过进一步对获取的信息进行归纳和扩展，系统将返回与该节目相关的一个标签集。Aiming at the above-mentioned deficiencies in the prior art, the present invention provides a system for automatically generating TV program labels. It only needs to provide the system with the name of the TV program, and the system can automatically obtain information related to the program from the Internet. The obtained information is summarized and expanded, and the system will return a tag set related to the program.

本发明是通过以下技术方案实现的。The present invention is achieved through the following technical solutions.

一种电视节目标签自动生成系统，包括依次连接的节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块，还包括与关键词扩展模块相连接的知识库模块，其中：An automatic generation system for TV program labels, comprising a sequentially connected program information acquisition module, information keyword extraction module, keyword expansion module and label generation module, and also includes a knowledge base module connected with the keyword expansion module, wherein:

-节目信息获取模块，用于从网上抓取与节目相关的页面，通过对页面的修剪和过滤，得到描述节目信息的主体内容；- The program information acquisition module is used to grab the pages related to the program from the Internet, and obtain the main content describing the program information by pruning and filtering the pages;

-信息关键词提取模块，用于汇总节目信息获取模块得到的主体内容，并从主体内容中抽取出关键词；- information keyword extraction module, used to summarize the main content obtained by the program information acquisition module, and extract keywords from the main content;

-知识库模块，用于建立词条间的网络关系，以便用于对获取的关键词进行扩展；- a knowledge base module, used to establish the network relationship between entries, so as to expand the acquired keywords;

-关键词扩展模块，用于利用知识库模块提供的网络，将信息关键词提取模块得到的关键词进行扩展，得到一个更大的词条集；-keyword extension module, used to utilize the network provided by the knowledge base module to expand the keywords obtained by the information keyword extraction module to obtain a larger entry set;

-标签生成模块，用于将得到的所有关键词的关联词条集进行处理，滤除噪声，计算分数，并最终生成节目的标签集。- A tag generation module, used to process the associated entry sets of all keywords obtained, filter out noise, calculate scores, and finally generate a tag set for the program.

所述节目信息获取模块包括HTML解析器，接收需要生成标签的目标电视节目集合，在搜索引擎的辅助下，为每个节目获取网络页面，所述页面通过HTML解析器的处理，得到主体内容，所述主体内容传递给信息关键词提取模块作进一步处理。Described program information acquisition module comprises HTML parser, receives the target television program set that needs to generate label, under the assistance of search engine, obtains network page for each program, and described page obtains main content through the processing of HTML parser, The main body content is passed to the information keyword extraction module for further processing.

所述信息关键词提取模块包括分词与词性标注器，得到描述每个节目信息的主体内容后，通过分词与词性标注器对内容进行划分，并仅保留名词词性的词语。The information keyword extraction module includes a word segmentation and part-of-speech tagger. After obtaining the main content describing each program information, the content is divided by the word segmentation and part-of-speech tagging device, and only the noun part-of-speech words are reserved.

所述名词词性的词语通过统计方法识别关键词。The words of the noun part of speech identify keywords through statistical methods.

所述统计方法包括以下步骤：Described statistical method comprises the following steps:

第一步，对于特定的某个节目，将词语划分为两组，一组来源于与该节目相关的网络页面，一组来源于节目集合中的其他网络页面；In the first step, for a specific program, the words are divided into two groups, one group comes from web pages related to the program, and one group comes from other web pages in the program collection;

第二步，对这两组词语计算词频，并统计出均值和标准差，这样，每个词语都用4个统计量描述其特征，所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差；The second step is to calculate the word frequency for these two groups of words, and count the mean and standard deviation. In this way, each word uses 4 statistics to describe its characteristics. The mean and standard deviation of the word frequency of , and the mean and standard deviation of the word frequency of this word on pages not related to the program;

第三步，根据4个统计量间的关系，将最能表现节目特征的关键词识别出来。In the third step, according to the relationship among the four statistics, the keywords that can best represent the characteristics of the program are identified.

所述知识库模块以百度百科作为数据源，以图的形式进行存储。The knowledge base module uses Baidu Encyclopedia as a data source and stores it in the form of a graph.

所述百度百科的组织方式包括以下步骤：The organization method of Baidu Encyclopedia includes the following steps:

第一步，对于每个词条，均有一个页面对该词条进行描述，页面中除了纯文本外，还会将百度百科中已有的其他词条作引用；In the first step, for each entry, there is a page describing the entry. In addition to the plain text, the page will also refer to other existing entries in Baidu Encyclopedia;

第二步，在知识库的图中，每个这样的被描述的词条和引用的词条间都会有一条有向边，对这个图应用PageRank算法，得到每个词条的重要性；In the second step, in the graph of the knowledge base, there will be a directed edge between each such described entry and the referenced entry, and the PageRank algorithm is applied to this graph to obtain the importance of each entry;

第三步，词条的权重和词条间的相互引用关系，构成了整个知识库。In the third step, the weight of entries and the mutual reference relationship between entries constitute the entire knowledge base.

所述关键词扩展模块对每个信息关键词提取模块得到的关键词，在知识库模块的图中找到与之存在一条路径的其他词条，根据词条本身的重要性和词条与关键词的距离，计算出词条的权重。The keyword expansion module finds other entries that have a path with it in the figure of the knowledge base module for the keywords obtained by each information keyword extraction module, and according to the importance of the entry itself and the entry and keyword The distance of the term is calculated to calculate the weight of the entry.

所述标签生成模块将所有得到的关键词的关联词条合并在一起，当一个词条同时关联多个关键词时，将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序，并根据需要返回前面的若干个，从而得到了描述节目特征的标签集。The label generating module combines all the associated entries of the obtained keywords together, and when an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. All the entries are sorted according to the sum of the weights, and the previous ones are returned as needed, so as to obtain the label set describing the characteristics of the program.

本发明工作时，先为系统提供需要生成标签的目标电视节目集合。节目信息获取模块在搜索引擎的辅助下，为每个节目都获取一定量的网络页面，这些页面通过模块中HTML解析器的处理，得到主体内容，这些主体内容将传递给信息关键词提取模块作进一步处理。信息关键词提取模块得到描述每个节目信息的主体内容后，通过模块中的分词与词性标注器对内容进行划分，并仅保留名词词性的词语。这些词语将以统计的方法识别出关键词。统计方法如下：对于特定的某个节目，将词语划分为两组。一组来源于与该节目相关的网络页面，一组来源于节目集合中的其他网络页面。对这两组词语都计算词频，并统计出均值和标准差。这样，每个词语都用4个统计量描述其特征。分别是这个词语在与节目相关页面的词频均值，标准差以及这个词语在与节目不相关页面的词频均值和标准差。根据4个统计量间的关系，就可以将最能表现节目特征的关键词识别出来。通过网络页面提取出来的关键词已经能在一定程度上反映节目的特征，但缺陷在于得到的关键词的范围有限，即它们必须出现在网络页面上。针对这一限制，本发明很重要的一点就是引入了知识库模块。知识库模块以百度百科作为数据源，以图的形式进行存储。百度百科的组织方式为，对于每个词条，都有一个页面对该词条进行描述，页面中除了纯文本外，还会对百度百科中已有的其他词条作引用。在知识库的图中，每个这样的被描述的词条和引用的词条间都会有一条有向边。对这个图应用PageRank算法，我们就得到了每个词条的重要性。词条的权重和词条间的相互引用关系，构成了整个知识库。这样，关键词扩展模块的任务很简单，对于每个信息关键词提取模块得到的关键词，都可以在知识库的图中找到与之存在一条路径的其他词条，根据词条本身的重要性和词条与关键词的距离，计算出词条的权重。标签生成模块是系统的最后一个环节，在信息关键词提取模块，我们得到了能够反映节目特征的关键词集，在关键词扩展模块，我们得到了每个关键词关联的词条集，而且每个词条都有权重。标签生成模块负责将两部分结果整合起来，即将所有得到的关键词的关联词条合并在一起。当一个词条同时关联多个关键词时，将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序，并根据需要返回前面的若干个，我们就得到了描述节目特征的标签集了。When the present invention works, the target television program collection that needs to generate tags is first provided for the system. With the assistance of the search engine, the program information acquisition module acquires a certain amount of web pages for each program. These pages are processed by the HTML parser in the module to obtain the main content, and these main content will be passed to the information keyword extraction module for further processing. After the information keyword extraction module obtains the main content describing each program information, it divides the content through the word segmentation and part-of-speech tagger in the module, and only retains the noun part-of-speech words. These words will statistically identify keywords. The statistical method is as follows: for a specific program, the words are divided into two groups. One set comes from web pages related to the show, and one set comes from other web pages in the show collection. The word frequency is calculated for both groups of words, and the mean and standard deviation are calculated. In this way, each word is characterized by 4 statistics. They are the mean and standard deviation of the word frequency of the word on the page related to the program, and the mean and standard deviation of the word frequency of the word on the page not related to the program. According to the relationship among the four statistics, the keywords that can best represent the characteristics of the program can be identified. The keywords extracted through the web pages can already reflect the characteristics of the program to a certain extent, but the disadvantage is that the range of the obtained keywords is limited, that is, they must appear on the web pages. Aiming at this limitation, a very important point of the present invention is to introduce a knowledge base module. The knowledge base module uses Baidu Encyclopedia as the data source and stores it in the form of a graph. Baidu Encyclopedia is organized in such a way that for each entry, there is a page describing the entry. In addition to the plain text, the page will also refer to other existing entries in Baidu Encyclopedia. In the knowledge base graph, there will be a directed edge between each such described term and the referenced term. Applying the PageRank algorithm to this graph, we get the importance of each term. The weight of entries and the mutual reference relationship between entries constitute the entire knowledge base. In this way, the task of the keyword expansion module is very simple. For each keyword obtained by the information keyword extraction module, other entries that have a path with it can be found in the graph of the knowledge base. According to the importance of the entry itself and the distance between the entry and the keyword to calculate the weight of the entry. The tag generation module is the last link of the system. In the information keyword extraction module, we get the keyword set that can reflect the characteristics of the program. In the keyword expansion module, we get the entry set associated with each keyword, and each Each entry has weight. The tag generation module is responsible for integrating the two parts of the results, that is, merging all associated entries of the obtained keywords together. When an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. Sort all the entries according to the sum of weights, and return the previous ones as needed, and we get the label set describing the characteristics of the program.

与现有技术相比，本发明填补了自动生成电视节目标签系统的空白，知识库的引入，也使得系统不会受制于网络页面，有更好的扩展性，对标签也有更好的发现力。知识库可以离线建立，标签生成算法简洁，故系统效率也很高。Compared with the prior art, the present invention fills the gap in the automatic generation of TV program labeling system, and the introduction of the knowledge base also makes the system not restricted by web pages, has better expansibility, and has better discoverability for labels . The knowledge base can be established offline, and the label generation algorithm is simple, so the system efficiency is also very high.

附图说明 Description of drawings

图1示出本发明的系统模块框图；Fig. 1 shows a system block diagram of the present invention;

图2示出本发明节目信息获取模块的实施细节；Fig. 2 shows the implementation details of the program information acquisition module of the present invention;

图3示出本发明信息关键词提取模块中词条列表的生成细节；Fig. 3 shows the generation details of the entry list in the information keyword extraction module of the present invention;

图4示出本发明信息关键词提取模块中关键词的生成细节。Fig. 4 shows the details of keyword generation in the information keyword extraction module of the present invention.

具体实施方式 Detailed ways

下面结合附图对本发明的实施例作详细说明，本实施例在以发明技术方案为前提下进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The embodiments of the present invention are described in detail below in conjunction with the accompanying drawings. This embodiment is implemented on the premise of the technical solution of the invention, and detailed implementation methods and specific operating procedures are provided, but the protection scope of the present invention is not limited to the following the embodiment.

本实施例的任务是为一组电视节目自动生成标签，分别是节目1、节目2、节目3、节目4、节目5、节目6、节目7、节目8、节目9、节目10。The task of this embodiment is to automatically generate tags for a group of TV programs, which are program 1, program 2, program 3, program 4, program 5, program 6, program 7, program 8, program 9, and program 10.

如图1所示，本实施例包括5个模块：节目信息获取模块、信息关键词提取模块、知识库模块、关键词扩展模块、标签生成模块，其中，节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块依次连接，知识库模块与关键词扩展模块相连接。所述节目信息获取模块，负责从网上抓取与这10个节目相关的页面，通过对页面的修剪和过滤，得到描述节目信息的主体内容。所述信息关键词提取模块，负责汇总节目信息获取模块得到的主体内容，并从主体内容中抽取出关键词。所述知识库模块，负责建立词条间的网络关系，以便用于对获取的关键词进行扩展。所述关键词扩展模块，负责利用知识库模块提供的网络，将信息关键词提取模块得到的关键词进行扩展，得到一个更大的词条集。所述标签生成模块，负责词条集进行处理，滤除噪声，计算分数，并最终生成节目的标签集。As shown in Figure 1, this embodiment includes 5 modules: program information acquisition module, information keyword extraction module, knowledge base module, keyword expansion module, label generation module, wherein, program information acquisition module, information keyword extraction module The keyword expansion module and the tag generation module are connected in sequence, and the knowledge base module is connected with the keyword expansion module. The program information acquisition module is responsible for grabbing pages related to these 10 programs from the Internet, and obtaining the main content describing the program information by pruning and filtering the pages. The information keyword extraction module is responsible for summarizing the main content obtained by the program information acquisition module, and extracting keywords from the main content. The knowledge base module is responsible for establishing the network relationship between entries, so as to expand the acquired keywords. The keyword expansion module is responsible for using the network provided by the knowledge base module to expand the keywords obtained by the information keyword extraction module to obtain a larger entry set. The label generation module is responsible for processing the entry set, filtering out noise, calculating scores, and finally generating the label set of the program.

如图2所示，节目信息获取模块包括HTML解析器，接收需要生成标签的目标电视节目集合，在搜索引擎的辅助下，为每个节目获取网络页面，所述页面通过HTML解析器的处理，得到主体内容，所述主体内容传递给信息关键词提取模块作进一步处理。具体为，节目信息获取模块利用搜索引擎，得到与目标节目相关的10个页面，即HTML文件。通过去除得到的HTML文件中如广告、图片、标题、脚本等的无用标记，我们就得到了描述节目信息的10个文档。As shown in Figure 2, the program information acquisition module includes an HTML parser to receive the set of target TV programs that need to generate tags, and with the assistance of a search engine, obtain a web page for each program, and the pages are processed by the HTML parser. The main content is obtained, and the main content is passed to the information keyword extraction module for further processing. Specifically, the program information acquisition module uses a search engine to obtain 10 pages related to the target program, that is, HTML files. By removing useless tags such as advertisements, pictures, titles, scripts, etc. in the obtained HTML files, we have obtained 10 files describing program information.

如图3所示，信息关键词提取模块包括分词与词性标注器，得到描述每个节目信息的主体内容后，通过分词与词性标注器对内容进行划分，并仅保留名词词性的词语。具体为，节目信息获取模块返回的文档会先通过信息关键词提取模块进行分词和词性标注的处理，并仅保留名词词性的词语，这样每个文档都被转换成一个词集。一个节目对应的10个文档会有重复的词语，所以10个文档的词语将进行哈希处理，统计出每个词语的在每个文档中的词频。最后我们针对每个节目都会得到一个词条列表，列表中的每一项是一个数据结构，包含词条的文本内容和该词条在10个文档中的词频。As shown in Figure 3, the information keyword extraction module includes word segmentation and part-of-speech tagging. After obtaining the main content describing each program information, the content is divided by word segmentation and part-of-speech tagging, and only noun words are reserved. Specifically, the documents returned by the program information acquisition module will first be processed by word segmentation and part-of-speech tagging through the information keyword extraction module, and only the noun part-of-speech words will be retained, so that each document is converted into a word set. The 10 documents corresponding to a program will have repeated words, so the words in the 10 documents will be hashed, and the word frequency of each word in each document will be counted. Finally, we will get a list of entries for each program, and each item in the list is a data structure, including the text content of the entry and the word frequency of the entry in 10 documents.

需要说明的是，名词词性的词语通过统计方法识别关键词。It should be noted that the words of the noun part of speech identify keywords through statistical methods.

统计方法包括以下步骤：The statistical method includes the following steps:

如图4所示，得到的词条列表经过进一步处理得到最终的关键词列表。这里，对于目标节目词条列表中的每一个词语，都计算出4个统计量，分别是：该词语在目标节目中的词频均值和标准差，该词语在其他节目中的词频均值和标准差。得到4个统计量后，先以这样的规则对词语进行归类：As shown in Fig. 4, the obtained word list is further processed to obtain the final keyword list. Here, for each word in the target program entry list, four statistics are calculated, namely: the word frequency mean and standard deviation of the word in the target program, the word frequency mean and standard deviation of the word in other programs . After obtaining the 4 statistics, first classify the words according to the following rules:

第一类：在其他节目中词频均值和标准差都是0；The first category: the mean and standard deviation of word frequency in other programs are both 0;

第二类：在其他节目中词频均值和标准差都不为0，而且目标节目中的均值比其他节目的均值大以及标准差比其他节目的小；The second category: the word frequency mean and standard deviation are both 0 in other programs, and the mean value in the target program is larger than the mean value of other programs and the standard deviation is smaller than that of other programs;

第三类：第一类和第二类之外的情况。The third category: the situation other than the first and second categories.

每一类再以这样的规则计算分数：Each category then calculates the score according to the following rules:

第一类：目标节目中的均值除以标准差；The first category: the mean in the target program divided by the standard deviation;

第二类：目标节目中的均值乘以其他节目的标准差除以目标节目中的标准差再除以其他节目的均值。The second type: the mean value in the target program multiplied by the standard deviation of other programs divided by the standard deviation in the target program and then divided by the mean value of other programs.

第三类：直接设为0。The third category: directly set to 0.

接下来对词语进行排序，第一类优于第二类，第二类优于第三类，同类别中按分数再排序，最后输出前20个词语构成关键词列表。Next, the words are sorted, the first category is better than the second category, the second category is better than the third category, and the same category is re-sorted according to the score, and finally the top 20 words are output to form a keyword list.

知识库模块以百度百科作为数据源，以图的形式进行存储。The knowledge base module uses Baidu Encyclopedia as the data source and stores it in the form of a graph.

需要说明的是：百度百科的组织方式包括以下步骤：It should be noted that the organization of Baidu Encyclopedia includes the following steps:

关键词列表中的每个关键词通过关键词扩展模块会得到关联的词条集，而且每个词条都有权重。标签生成模块会将两部分结果整合起来，即将所有得到的关键词的关联词条合并在一起。当一个词条同时关联多个关键词时，将这个词条在各种关键词中的权重相加。将所有词条根据权重的总和进行排序，并返回前20个词条，我们就得到了描述节目特征的标签集了。Each keyword in the keyword list will get an associated entry set through the keyword expansion module, and each entry has a weight. The tag generation module will integrate the two parts of the results, that is, merge all associated entries of the obtained keywords together. When an entry is associated with multiple keywords at the same time, the weights of this entry in various keywords are added. Sort all entries according to the sum of weights, and return the top 20 entries, we get the label set describing the characteristics of the program.

对实验例中的10节目重复以上过程，我们就完成了为这些节目自动生成标签的任务。Repeat the above process for the 10 programs in the experimental example, and we have completed the task of automatically generating labels for these programs.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.

Claims

1. TV programme label automatic creation system; It is characterized in that; Comprise the programme information acquisition module, information keyword extracting module, keyword expansion module and the label generation module that connect successively, also comprise the base module that is connected with the keyword expansion module, wherein:

-programme information acquisition module is used for grasping from network the page relevant with program, through pruning and the filtration to the page, obtains describing the body matter of programme information;

-information keyword extracting module is used to gather the body matter that the programme information acquisition module obtains, and from body matter, extracts keyword;

-base module is used to set up the cyberrelationship between entry, for use in the keyword that obtains is expanded;

-keyword expansion module is used to the network that utilizes base module to provide, and the keyword that the information keyword extracting module obtains is expanded, and obtains a bigger entry collection;

-label generation module is used for the related entry collection of all keywords that obtain is handled, and filtering noise calculates mark, and finally generates the tally set of program.

2. TV programme label automatic creation system according to claim 1 is characterized in that said programme information acquisition module comprises html parser; Reception needs to generate the target set of TV shows of label; Assisting down of search engine, be each program acquisition Webpage, the said page is through the processing of html parser; Obtain body matter, said body matter passes to the information keyword extracting module and is for further processing.

3. TV programme label automatic creation system according to claim 1; It is characterized in that; Said information keyword extracting module comprises participle and part-of-speech tagging device; After obtaining describing the body matter of each programme information, content is divided, and only keep the word of noun part of speech through participle and part-of-speech tagging device.

4. TV programme label automatic creation system according to claim 1 is characterized in that, the word of said noun part of speech is through statistical method identification keyword.

5. according to claim 4 TV programme label automatic creation system, it is characterized in that said statistical method may further comprise the steps:

The first step for certain specific program, is divided into two groups with word, and one group derives from the Webpage relevant with this program, one group of other Webpage that derive from the program set;

Second step; These two groups of words are calculated word frequency; And count average and standard deviation; Like this, each word all uses 4 statistics to describe its characteristic, said 4 statistics be respectively this word with word frequency average, standard deviation and this word of program related pages with the word frequency average and the standard deviation of the uncorrelated page of program;

In the 3rd step,, the key word recognition that can show programs feature is come out according to the relation between 4 statistics.

6. TV programme label automatic creation system according to claim 1 is characterized in that said base module, is stored with the form of scheming as data source with Baidu's encyclopaedia.

7. TV programme label automatic creation system according to claim 6 is characterized in that, the organizational form of said Baidu encyclopaedia may further comprise the steps:

The first step for each entry, all has a page that this entry is described, and except plain text, also can existing other entries in Baidu's encyclopaedia be quoted in the page;

In the figure of knowledge base, all can there be a directed edge in second step between entry that is described that each is such and the entry of quoting, and to this figure Using P ageRank algorithm, obtained the importance of each entry;

In the 3rd step, the mutual adduction relationship between the weight of entry and entry has constituted whole knowledge base.

8. TV programme label automatic creation system according to claim 1; It is characterized in that; The keyword that said keyword expansion module obtains each information keyword extracting module; In the figure of base module, find other entries that have a paths with it,, calculate the weight of entry according to the importance of entry itself and the distance of entry and keyword.

9. TV programme label automatic creation system according to claim 1; It is characterized in that the related entry of all keywords that said label generation module will obtain combines, when the simultaneously related a plurality of keyword of entry; With the weight addition of this entry in various keywords; The summation of all entries according to weight sorted, and return the front several as required, thereby obtained describing the tally set of programs feature.