CN114911944A - A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model - Google Patents
A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model Download PDFInfo
- Publication number
- CN114911944A CN114911944A CN202111614178.1A CN202111614178A CN114911944A CN 114911944 A CN114911944 A CN 114911944A CN 202111614178 A CN202111614178 A CN 202111614178A CN 114911944 A CN114911944 A CN 114911944A
- Authority
- CN
- China
- Prior art keywords
- entity
- knowledge graph
- information
- data
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及人工智能领域,主要涉及媒资知识图谱技术。The invention relates to the field of artificial intelligence, and mainly relates to the technology of knowledge graph of media resources.
背景技术Background technique
2012年,Google公司为了优化搜索引擎,提出了知识图谱。知识图谱的搜 索方式与传统搜索引擎不同,传统的搜索引擎是基于关键词搜索,而知识图谱 是基于实体间的关系,优化了搜索方式,提高了搜索质量。In 2012, Google proposed a knowledge graph to optimize its search engine. The search method of knowledge graph is different from that of traditional search engines. The traditional search engine is based on keyword search, while knowledge graph is based on the relationship between entities, which optimizes the search method and improves the search quality.
随着大数据时代的到来,电视业务进入存量发展阶段,需进行专业化和精 细化运营能力提升。统一媒资是对媒资信息进行汇聚、整合、加工和输出的功 能,通过用户行为数据和媒资信息数据对用户观看行为进行深度挖掘,能够提 升业务使用感知,提高用户活跃度和粘性。但是在现有的媒资管理系统中,媒 资信息存在严重缺失或者错误的现象,人工校准匹配已经不能满足精准营销的 迫切需求。With the advent of the era of big data, the TV business has entered a stage of stock development, which requires professional and refined operation capabilities. Unified media assets is the function of aggregating, integrating, processing and outputting media asset information. Through user behavior data and media asset information data, users' viewing behaviors can be deeply mined, which can improve service usage perception and increase user activity and stickiness. However, in the existing media asset management system, media asset information is seriously missing or wrong, and manual calibration and matching can no longer meet the urgent needs of precision marketing.
专利“一种媒资合并方法及其装置”(CN202010128799.8)通过名称相似 度和主题相似度来进行媒资合并,但是其匹配效果较低,不具备自动学习能力, 分析主要依赖人工,自动化程度较低,结果展现方式传统。The patent "a method and device for merging media assets" (CN202010128799.8) uses name similarity and theme similarity to merge media assets, but its matching effect is low, and it does not have automatic learning ability. The analysis mainly relies on manual, automatic To a lesser degree, the results are presented in a traditional way.
专利“一种媒资数据整合方法及系统”(CN201610777461.9)采用字段逐 个比较的方法将待处理媒资与标准媒资匹配,缺乏媒资实体和属性的关联关系, 媒资数据之间呈弱关联,构建的媒资库不具备认知能力。The patent "a method and system for integrating media asset data" (CN201610777461.9) adopts the method of field-by-field comparison to match the pending media asset with the standard media asset, lacking the association relationship between media asset entities and attributes, and the media asset data is presented in the same way. Weak association, the constructed media library does not have cognitive ability.
因此,需要一种能够高效、准确地对齐媒资实体,以构建媒资知识图谱的 方法。Therefore, there is a need for a method that can efficiently and accurately align media asset entities to construct a media asset knowledge graph.
发明内容SUMMARY OF THE INVENTION
提供本发明内容以便以简化形式介绍将在以下具体实施方式中进一步的 描述一些概念。本发明内容并非旨在标识所要求保护的主题的关键特征或必要 特征,也不旨在用于帮助确定所要求保护的主题的范围。This Summary is provided to introduce a simplified form of some concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
在本发明中,在媒资信息融合中,取代传统实体对齐方法,采用了人工干 预和机器学习算法结合的方法。首先引入TextRank关键词提取算法,获取与实 体内容更相关的关键词作为媒资标签,一方面对实体对齐的语义进行补充,另 一方面也对抽取的属性值进行补充。然后根据实体属性,对实体进行详细二级 分类,缩小实体对齐范围,提前干预匹配结果,降低对齐算法复杂度。最后, 利用神经网络技术,学习实体和属性的深层语义关系,具备学习能力,自动学 习不同类型的实体与属性的特征;引入注意力机制,通过对局部特征向量进行 加权,在实现输入与输出的对齐的同时还能够利用更多的原始数据的上下文信息,能够不断提升不同数据源的实体对齐准确率。In the present invention, in the media asset information fusion, instead of the traditional entity alignment method, a method combining manual intervention and machine learning algorithm is adopted. Firstly, the TextRank keyword extraction algorithm is introduced to obtain keywords that are more relevant to the entity content as media asset tags. On the one hand, it supplements the semantics of entity alignment, and on the other hand, it also supplements the extracted attribute values. Then, according to the entity attributes, detailed secondary classification of entities is carried out, the scope of entity alignment is narrowed, the matching results are intervened in advance, and the complexity of the alignment algorithm is reduced. Finally, the neural network technology is used to learn the deep semantic relationship between entities and attributes, and it has the ability to learn and automatically learn the characteristics of different types of entities and attributes; the attention mechanism is introduced to weight the local feature vectors to realize the integration of input and output. Alignment can also utilize more contextual information of the original data, which can continuously improve the entity alignment accuracy of different data sources.
根据本方发明的一个实施例,公开了一种用于面向多源媒资数据的知识图 谱构建的方法,包括:从多源数据中抽取媒资信息,以形成以实体为中心的知 识图谱三元组,其中抽取媒资信息包括实体抽取、关系抽取和实体属性抽取, 所述多源数据来自包括半结构化数据的网页和/或包括非结构化数据的网页,所 述三元组具有{实体,关系,属性}的形式;对所述媒资信息进行信息融合,以 形成经更新的三元组,其中,所述信息融合包括实体对齐和属性统一,其中所 述实体对齐采用基于注意力机制的卷积神经网络的实体匹配模型;基于经更新 的三元组来构建所述知识图谱。According to an embodiment of the present invention, a method for constructing a knowledge graph for multi-source media resource data is disclosed, including: extracting media resource information from the multi-source data to form an entity-centric knowledge graph III. Tuple, wherein extracting media asset information includes entity extraction, relationship extraction and entity attribute extraction, the multi-source data comes from web pages including semi-structured data and/or web pages including unstructured data, and the triples have { entity, relationship, attribute}; information fusion is performed on the media asset information to form an updated triple, wherein the information fusion includes entity alignment and attribute unification, wherein the entity alignment adopts attention-based An entity matching model of a mechanism of convolutional neural networks; the knowledge graph is constructed based on the updated triples.
根据本发明的另一个实施例,公开了一种用于面向多源媒资数据的知识图 谱构建的系统,包括:信息抽取模块,所述信息抽取模块被配置为:从多源数 据中抽取媒资信息,以形成以实体为中心的知识图谱三元组,其中抽取媒资信 息包括实体抽取、关系抽取和实体属性抽取,所述多源数据来自包括半结构化 数据的网页和/或包括非结构化数据的网页,所述三元组具有{实体,关系,属 性}的形式;信息融合模块,所述信息融合模块被配置为:对所述媒资信息进行 信息融合,以形成经更新的三元组,其中,所述信息融合包括实体对齐和属性 统一,其中所述实体对齐包括采用基于注意力机制的卷积神经网络的实体匹配 模型;知识图谱构建模块,所述知识图谱构建模块被配置为基于经更新的三元 组来构建所述知识图谱。According to another embodiment of the present invention, a system for building a knowledge graph for multi-source media resource data is disclosed, comprising: an information extraction module, the information extraction module is configured to: extract media from multi-source data resource information to form entity-centric knowledge graph triples, wherein the extraction of media resource information includes entity extraction, relation extraction and entity attribute extraction, and the multi-source data comes from web pages including semi-structured data and/or including non-structured data. A web page of structured data, the triplet has the form of {entity, relationship, attribute}; an information fusion module, the information fusion module is configured to: perform information fusion on the media asset information to form an updated triplet, wherein the information fusion includes entity alignment and attribute unification, wherein the entity alignment includes an entity matching model using an attention mechanism-based convolutional neural network; a knowledge graph building module, the knowledge graph building module is is configured to build the knowledge graph based on the updated triples.
根据本发明的又一个实施例,公开了一种用于面向多源媒资数据的知识图 谱构建的计算设备,包括:处理器;存储器,所述存储器存储有指令,所述指 令在被所述处理器执行时能执行如上所述的方法。According to yet another embodiment of the present invention, a computing device for building a knowledge graph for multi-source media resource data is disclosed, including: a processor; The processor can execute the method as described above when executed.
通过阅读下面的详细描述并参考相关联的附图,这些及其他特点和优点将 变得显而易见。应该理解,前面的概括说明和下面的详细描述只是说明性的, 不会对所要求保护的各方面形成限制。These and other features and advantages will become apparent upon reading the following detailed description with reference to the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are illustrative only and not restrictive of the claimed aspects.
附图说明Description of drawings
为了能详细地理解本发明的上述特征所用的方式,可以参照各实施例来 对以上简要概述的内容进行更具体的描述,其中一些方面在附图中示出。然 而应该注意,附图仅示出了本发明的某些典型方面,故不应被认为限定其范 围,因为该描述可以允许有其它等同有效的方面。In order that the manner in which the above-described features of the present invention can be understood in detail, what has been briefly summarized above may be described in more detail with reference to various embodiments, some aspects of which are illustrated in the accompanying drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of the invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.
图1示出了根据本发明的一个实施例的用于面向多源媒资数据的知识图谱 构建的系统100的框图;1 shows a block diagram of a
图2示出了根据本发明的一个实施例的基于注意力机制的卷积神经网络的 实体匹配模型200的示意图;2 shows a schematic diagram of an
图3示出了根据本发明的一个实施例的用于面向多源媒资数据的知识图谱 构建的方法300的流程图;以及Figure 3 shows a flowchart of a
图4出了根据本发明的一个实施例的示例性计算设备400的框图。FIG. 4 shows a block diagram of an
具体实施方式Detailed ways
下面结合附图详细描述本发明,本发明的特点将在以下的具体描述中得到 进一步的显现。The present invention will be described in detail below in conjunction with the accompanying drawings, and the features of the present invention will be further revealed in the following detailed description.
以下为在本发明中使用的术语的解释,其包括本领域的技术人员所熟知的 一般含义:The following are explanations of terms used in the present invention, which include the ordinary meanings well known to those skilled in the art:
知识图谱:一种基于图的数据结构,每个节点表示“实体”(可以是具体 的人、事、物体,也可以是抽象的概念),每条边表示实体与实体之间的“关系”。 “实体”与“关系”都可以有各自的“属性”。因此,“实体”、“关系”和 “属性”构成知识图谱的组成三要素。Knowledge graph: a graph-based data structure, each node represents an "entity" (which can be a specific person, thing, object, or an abstract concept), and each edge represents a "relationship" between entities and entities . Both "entities" and "relationships" can have their own "properties". Therefore, "entities", "relationships" and "attributes" constitute the three components of the knowledge graph.
TextRank算法:一种基于图的用于关键词抽取和文档摘要的排序算法,它 利用一篇文档内部的词语间的共现信息(语义)便可以抽取关键词,它能够从 一个给定的文本中抽取出该文本的关键词、关键词组,并使用抽取式的自动文 摘方法抽取出该文本的关键句。TextRank algorithm: a graph-based ranking algorithm for keyword extraction and document summarization. It uses the co-occurrence information (semantics) between words in a document to extract keywords. It can extract keywords from a given text. Extract the keywords and keyword groups of the text from the text, and extract the key sentences of the text using the extractive automatic summarization method.
Neo4j:一种高性能的NOSQL图形数据库,它将结构化数据存储在网络上 而不是表中。Neo4j底层会以图的方式把用户定义的节点以及关系存储起来, 通过这种方式,可是高效的实现从某个节点开始,通过节点与节点间关系,找 出两个节点间的联系。Neo4j使用Cypher查询图形数据,Cypher是描述性的图 形查询语言。Neo4j: A high-performance NOSQL graph database that stores structured data on the network rather than in tables. The bottom layer of Neo4j will store the user-defined nodes and relationships in a graph. In this way, the efficient implementation starts from a certain node and finds the relationship between two nodes through the relationship between nodes. Neo4j queries graph data using Cypher, a descriptive graph query language.
D3技术:D3的全称是Data-Driven Documents,即数据驱动文档,用来进 行数据可视化。D3 technology: The full name of D3 is Data-Driven Documents, that is, data-driven documents, which are used for data visualization.
当前人工完成70万条节目媒资匹配一般需要5个工作日,为了实现媒资 内容自动关联,提高搜索、推荐精确性,为电视行业媒资信息管理提供媒资数 据赋能,本发明利用网络媒资信息数据,通过知识图谱技术构建面向多源头媒 资数据的知识图谱。At present, it generally takes 5 working days to manually complete the matching of 700,000 program media assets. In order to realize the automatic association of media asset contents, improve the accuracy of search and recommendation, and provide media asset data empowerment for media asset information management in the TV industry, the present invention utilizes the network Media asset information data, through knowledge graph technology to build a knowledge graph for multi-source media asset data.
由于媒资数据的多源性,实体和属性信息存在差异和冗余现象,当前知识 图谱中的实体对齐采用的是一些通用的实体对齐方法,不能很好地理解媒资实 体和属性之间的语义关联,对多个数据源的媒资实体对齐准确率较低。本发明 在构建面向多源媒资数据的知识图谱时,在实体对齐中引入TextRank关键词 提取算法、实体属性二级分类、卷积神经网络和注意力机制,能够更好地理解 实体和属性之间的语义关联,提高媒资实体对齐准确度。Due to the multi-source of media asset data, there are differences and redundancy in entity and attribute information. The entity alignment in the current knowledge graph adopts some general entity alignment methods, which cannot well understand the relationship between media asset entities and attributes. Semantic association, the alignment accuracy of media asset entities from multiple data sources is low. When constructing a knowledge graph oriented to multi-source media resource data, the present invention introduces TextRank keyword extraction algorithm, entity attribute secondary classification, convolutional neural network and attention mechanism into entity alignment, which can better understand the relationship between entity and attribute. Semantic association between media assets to improve the alignment accuracy of media asset entities.
图1示出了根据本发明的一个实施例的用于面向多源媒资数据的知识图谱 构建的系统100的框图。如图1中所示的,该系统100按模块进行划分,各模 块之间通过本领域已知的方式进行通信和数据交换。在本发明中,各模块可通 过软件或硬件或其组合的方式来实现。如图1所示,该系统100包括信息抽取 模块101、信息融合模块102以及知识图谱构建模块103。Figure 1 shows a block diagram of a
一般而言,媒资知识图谱的构建主要包括:选择合适的多源数据,从多源 数据中抽取媒资信息,对抽取的媒资信息进行数据处理,基于处理后的媒资信 息构建知识图谱,并在实际应用中(例如,搜索、呈现等)进行发布展示。Generally speaking, the construction of a media resource knowledge graph mainly includes: selecting appropriate multi-source data, extracting media resource information from the multi-source data, performing data processing on the extracted media resource information, and constructing a knowledge graph based on the processed media resource information. , and publish it in practical applications (eg, search, presentation, etc.).
具体而言,参考图1,信息抽取模块101从不同的数据源进行媒资信息的 抽取,主要包括实体抽取、关系抽取、属性抽取。信息融合模块102将信息抽 取模块101抽取的媒资信息进行融合。知识图谱构建模块103根据融合后的媒 资信息来构建知识图谱。Specifically, referring to Fig. 1 , the information extraction module 101 extracts media information from different data sources, mainly including entity extraction, relationship extraction, and attribute extraction. The information fusion module 102 fuses the media asset information extracted by the information extraction module 101. The knowledge graph building module 103 builds a knowledge graph according to the fused media asset information.
根据本发明的一个实施例,信息抽取模块101被配置为从多源数据104中 抽取媒资信息,例如,实体抽取、关系抽取、属性抽取,以形成以实体为中心 的知识图谱三元组。According to one embodiment of the present invention, the information extraction module 101 is configured to extract media resource information from the multi-source data 104, such as entity extraction, relation extraction, attribute extraction, to form entity-centric knowledge graph triples.
如本领域的技术人员所知,媒资信息主要包括媒体资源的信息,例如电视 剧、电影、综艺、动漫、儿童、纪录片、音乐、教育、母婴、体育、演员、明 星。综合考虑媒资信息数据的规模、质量和数据获取的难易程度,本发明的多 源数据104来自两种不同类别的数据源,即主要包括非结构化数据的数据源以 及主要包括半结构化数据的数据源。举例而言,非结构化数据的数据源可为百 度百科,其是一个内容开放,自由的网络平台,几乎涵盖所有行业知识领域, 对媒资信息数据描述比较丰富、完整,且更新实时,但是由于人工编辑的原因, 信息结构的一致性较差,特别是影视类媒资信息,质量差异也较大。为此,为 提高例如影视类媒资信息质量,同时选取半结构化数据的数据源,例如,豆瓣 电影,作为影视类媒资信息的补充,豆瓣电影是拥有最新的影视信息的影视评 论网站,其数据结构较好。当然,以上的网站仅仅是出于示例性目的而给出的, 本领域的技术人员完全可以采用其他的数据源来获取媒资信息。当然,也可以 仅采用一种类别的数据源来获取媒资信息。As known to those skilled in the art, media resource information mainly includes information of media resources, such as TV series, movies, variety shows, animation, children, documentaries, music, education, mothers and babies, sports, actors, and stars. Comprehensively considering the scale, quality and difficulty of data acquisition of media resource information data, the multi-source data 104 of the present invention comes from two different types of data sources, namely data sources mainly including unstructured data and data sources mainly including semi-structured data The data source of the data. For example, the data source of unstructured data can be Baidu Encyclopedia, which is an open and free network platform that covers almost all industry knowledge fields. Due to manual editing, the consistency of information structure is poor, especially for film and television media information, the quality of which is also quite different. For this reason, in order to improve the information quality of media resources such as film and television, a data source of semi-structured data, such as Douban Movie, is selected as a supplement to the information of film and television media resources. Douban Movie is a film and television review website with the latest film and television information. Its data structure is better. Of course, the above websites are only given for exemplary purposes, and those skilled in the art can completely obtain media asset information by using other data sources. Of course, it is also possible to use only one type of data source to obtain media asset information.
根据本发明的一个实施例,从多源数据104中抽取媒资信息主要包括(1) 利用python语言制定爬虫程序,以获取媒资数据知识图谱数据源;(2)通过 网页解析来获取媒资标签和结构化信息;以及(3)从获取的结构化信息中抽取 构建知识图谱的三元组。According to an embodiment of the present invention, extracting the media resource information from the multi-source data 104 mainly includes (1) formulating a crawler program using python language to obtain the knowledge graph data source of the media resource data; (2) obtaining the media resource through web page parsing labels and structured information; and (3) extract triples from the acquired structured information to construct a knowledge graph.
根据本发明的一个实施例,继续上述百度百科和豆瓣电影的示例,(1)利 用python语言制定爬虫程序,以获取媒资数据知识图谱数据源包括:1.1)按基 础条件定向爬取网页词条(例如,百度百科词条)。具体而言,可根据类别特 征词来搜索网页内容,其中,类别特征词是指能够区分各类媒资数据(例如, 十类)的词汇。例如,对于电视剧类媒资数据,类别特征词选取“剧集”、“分 集”、“电视剧”,对于教育类媒资数据,类别特征词选取“教育”、“学前”、 “小学”、“中学”等;1.2)根据影视类别,包括电影、电视剧、综艺、动漫、 纪录片,爬取豆瓣电影网页数据。According to an embodiment of the present invention, continuing the above examples of Baidu Encyclopedia and Douban Movie, (1) using python language to formulate a crawler program to obtain media resource data The knowledge graph data source includes: 1.1) directional crawling web page entries according to basic conditions (For example, Baidu Encyclopedia entries). Specifically, the webpage content can be searched according to category feature words, wherein the category feature words refer to words that can distinguish various types of media asset data (for example, ten categories). For example, for TV drama media data, the category feature words select "episode", "diversity", and "TV drama"; for education media data, the category feature words select "education", "preschool", "primary school", " 1.2) According to the category of film and television, including movies, TV series, variety shows, animation, documentaries, crawl Douban movie webpage data.
根据本发明的一个实施例,(2)通过网页解析来获取媒资标签和结构化信 息可进一步包括:2.1)从媒资数据知识图谱数据源中提取关键字以构建媒资标 签库;2.2)从媒资数据知识图谱数据源中抽取结构化信息。According to an embodiment of the present invention, (2) obtaining media asset tags and structured information through web page parsing may further include: 2.1) extracting keywords from a knowledge graph data source of media asset data to construct a media asset tag library; 2.2) Extract structured information from media asset data knowledge graph data sources.
根据本发明的一个实施例,2.1)从媒资数据知识图谱数据源中提取关键字 以构建媒资标签库可进一步包括利用TextRank算法对网页的文本(比如百度 百科的简介)进行关键词提取。该关键词提取可包括以下几个步骤:According to an embodiment of the present invention, 2.1) extracting keywords from the media resource data knowledge graph data source to construct a media resource tag library may further include using the TextRank algorithm to extract keywords from the text of the web page (such as the introduction of Baidu Encyclopedia). The keyword extraction may include the following steps:
①首先采用jieba分词工具对网页提取的文本进行分词,同时进行词性标 注,并过滤掉停用词,只保留名词、动词和形容词词性的单词。① First, the jieba word segmentation tool is used to segment the text extracted from the web page, and at the same time, part of speech is marked, and stop words are filtered out, and only words with nouns, verbs and adjectives are retained.
②将分词后的词语构建候选关键词图G=(V,E),其中V为①生成的候选 关键词组成,然后采用共现关系构造任两点之间的边。两个节点之间存在边 仅当它们对应的词汇在长度为K的窗口中共现,K表示窗口大小,即最多共 现K个单词。②Construct the candidate keyword graph G=(V,E) from the segmented words, where V is composed of the candidate keywords generated by ①, and then use the co-occurrence relationship to construct the edge between any two points. An edge exists between two nodes only if their corresponding words co-occur in a window of length K, where K represents the window size, that is, at most K words co-occur.
③迭代传播各节点的权重,直至收敛。③ Iteratively propagate the weight of each node until convergence.
④对节点权重进行倒序排序,从而得到最重要的T个单词,作为候选关 键词。④ Sort the node weights in reverse order to get the most important T words as candidate keywords.
⑤由④得到最重要的T个单词,在原始文本中进行标记,若形成相邻词 组,则组合成多词关键词,作为媒资标签。⑤ Obtain the most important T words from ④ and mark them in the original text. If adjacent phrases are formed, they are combined into multi-word keywords as media asset labels.
根据本发明的一个实施例,2.2)从媒资数据知识图谱数据源中抽取结构化 信息可进一步包括根据字体特性来抓取网页的词条名,即实体名称。例如,可 以根据抓取网页url的标题数据格式(标题位于类lemmaWgt-lemmaTitle-title下 的h1子标签)来抓取实体名称。According to an embodiment of the present invention, 2.2) extracting structured information from the knowledge graph data source of media asset data may further include grabbing the entry name of the web page, that is, the entity name, according to font characteristics. For example, the entity name can be fetched according to the title data format of the fetched web page url (the title is in the h1 sub-tag under the class lemmaWgt-lemmaTitle-title).
一般而言,在媒资领域,实体主要分为视听类实体(例如,视频、音频等)、 非视听类实体(例如,书籍、展览等)和人物实体。实体名称为该实体的名称, 例如电视机的剧名、电影的电影名、书籍的名称等等。实体的属性主要包括描 述该实体的信息。例如,视听类实体的属性主要包括中文名、别名、类型、出 品公司、首播时间、导演、主演等。非视听属性主要包括中文名、别名、举办 地点、举办时间、主办机构等。人物实体属性主要包括中文名、别名、国籍、 民族、血型、身高、星座、代表作品、主要成就、职业等。Generally speaking, in the field of media assets, entities are mainly divided into audiovisual entities (eg, video, audio, etc.), non-audiovisual entities (eg, books, exhibitions, etc.) and character entities. The entity name is the name of the entity, such as the title of a TV series, the title of a movie, the title of a book, and so on. The attributes of an entity mainly include information describing the entity. For example, the attributes of audio-visual entities mainly include Chinese name, alias, genre, production company, first broadcast time, director, starring and so on. Non-audio-visual attributes mainly include Chinese name, alias, venue, time, organizer, etc. Character entity attributes mainly include Chinese name, alias, nationality, ethnicity, blood type, height, constellation, representative works, main achievements, occupation, etc.
例如,百度百科词条主要由视频、图片、文本和图表组成,本发明主要提 取网页中的表格数据。表格中的字段名和字段值的对应关系正好对应知识图谱 中的实体、关系和属性。实体间的关系主要包括人物实体和视听/非视听实体之 间的“参演/导演/编剧/主办/参加”关系。对豆瓣电影网页的解析与以上描述的 方式是类似的。例如,可以在网页的源代码中找到存放实体(例如,影片)信 息的源代码,根据标题、实体(例如,影片)信息的标签数据格式来解析网页 词条名称和对应的字段名、字段值。举例而言,可以通过标识网页源代码中的 “$$”来解析网页的网页词条名称和对应的字段名、字段值。For example, Baidu Encyclopedia entries are mainly composed of videos, pictures, texts and diagrams, and the present invention mainly extracts table data in web pages. The corresponding relationship between the field name and field value in the table just corresponds to the entity, relationship and attribute in the knowledge graph. The relationship between entities mainly includes the "participating/directing/writing/hosting/participating" relationship between character entities and audiovisual/non-audiovisual entities. The parsing of the Douban Movie web page is similar to that described above. For example, the source code for storing entity (eg, movie) information can be found in the source code of the webpage, and the webpage entry name and corresponding field name and field value can be parsed according to the tag data format of the title and entity (eg, movie) information. . For example, the web page entry name and the corresponding field name and field value of the web page can be parsed by identifying "$$" in the web page source code.
根据本方发明的一个实施例,(3)从获取的结构化信息中抽取构建知识图 谱的三元组进一步包括根据(2)中抽取的实体、关系和属性来构建以实体为中 心的{实体,关系,属性}的三元组。According to an embodiment of the present invention, (3) extracting the triples for constructing the knowledge graph from the acquired structured information further includes constructing an entity-centric {entity , relation, attribute } triplet.
根据本发明的一个实施例,信息融合模块102被配置为基于信息抽取模块 101形成的知识图谱的三元组,对媒资信息进行融合。According to an embodiment of the present invention, the information fusion module 102 is configured to fuse the media asset information based on the triplet of the knowledge graph formed by the information extraction module 101.
根据本发明的一个实施例,对媒资信息进行融合进一步包括(1)将媒资实 体进行对齐;(2)对不同、异构数据源中的属性词汇进行统一以及(3)将实 体关键词(即,媒资标签)纳入属性值作为补充。According to an embodiment of the present invention, the fusion of media resource information further includes (1) aligning media resource entities; (2) unifying attribute vocabulary in different and heterogeneous data sources and (3) combining entity keywords (ie, the media asset tag) incorporates the attribute value as a complement.
根据本发明的一个实施例,针对不同数据源提取的相同实体可能存在名称 不同的情况,需对实体和对应属性进行相似度比较,(1)媒资实体对齐可包括 实体分类和实体匹配。According to an embodiment of the present invention, for the same entity extracted from different data sources may have different names, it is necessary to compare the similarity between the entity and the corresponding attribute. (1) Entity alignment of media assets may include entity classification and entity matching.
根据本发明的一个实施例,在实体分类中,根据信息抽取模块101抽取的 实体属性对实体进行二级分类,从而有效缩小实体匹配范围。在本发明的上述 示例中,按照以下来进行二级分类,分类结果类别如下:According to an embodiment of the present invention, in the entity classification, the entity is classified into two levels according to the entity attributes extracted by the information extraction module 101, thereby effectively narrowing the entity matching range. In the above-mentioned example of the present invention, the secondary classification is carried out as follows, and the classification result categories are as follows:
当然,以上的一级和二级分类仅仅是示例性的,本领域的技术人员完全可 以根据不同的媒资来定义不同的多级分类。Of course, the above primary and secondary classifications are only exemplary, and those skilled in the art can completely define different multi-level classifications according to different media resources.
根据本发明的一个实施例,在实体匹配中,基于各个类别下的实体,通过 一种基于注意力机制的卷积神经网络的实体匹配模型对实体、实体属性和实体 关键词(即,信息抽取模块101获取的媒资关键词)来进行语义匹配。图2示 出了根据本发明的一个实施例的基于注意力机制的卷积神经网络的实体匹配 模型200的示意图。参考图2可见,实体匹配可包括以下步骤:According to an embodiment of the present invention, in entity matching, based on entities under each category, entities, entity attributes and entity keywords (ie, information extraction The keywords of the media assets acquired by the module 101) are used for semantic matching. FIG. 2 shows a schematic diagram of an
1)构建语料集。利用影视类词条名称、属性值和提取的媒资标签构建媒资 实体匹配训练集,首先根据属性名称对待匹配的不同来源实体的属性值做对齐, 利用word2Vec模型对训练集中的实体名称、实体属性值和实体关键词进行向 量化处理,以得到实体、属性和关键词的词向量表示;1) Build a corpus. The media asset entity matching training set is constructed by using the film and television entry names, attribute values and the extracted media asset labels. First, the attribute values of different source entities to be matched are aligned according to the attribute names, and the word2Vec model is used to match the entity names and entities in the training set. Attribute values and entity keywords are vectorized to obtain word vector representations of entities, attributes and keywords;
2)基于注意力机制的卷积神经网络的实体匹配模型进行实体匹配。实体 匹配模型由输入层、卷积层、注意力层、池化层、输出层组成。2) The entity matching model of the convolutional neural network based on the attention mechanism performs entity matching. The entity matching model consists of an input layer, a convolutional layer, an attention layer, a pooling layer, and an output layer.
·输入层。对实体、属性值和关键词进行预处理获得序列词向量表示 x=(x1,x2,…,xi,…,xn),Xi∈Rd表示序列向量中第i个词向量,输入的词向 量根据实体名称、属性、关键词顺序依次从上到下排列,每个词向量的 维数为k,则生成的矩阵维度为n*k,对于未登录词用0填充;· Input layer. Preprocess entities, attribute values and keywords to obtain the sequence word vector representation x=(x 1 ,x 2 ,...,x i ,...,x n ), X i ∈R d represents the i-th word vector in the sequence vector , the input word vectors are arranged from top to bottom according to the order of entity names, attributes, and keywords. The dimension of each word vector is k, then the dimension of the generated matrix is n*k, and the unregistered words are filled with 0;
·卷积层。通过卷积操作获取输入矩阵的局部特征和位置特征,卷积公式 如下:· Convolutional layers. The local features and position features of the input matrix are obtained through the convolution operation. The convolution formula is as follows:
其中,w∈Rd*k为过滤器,b∈R为偏置,f是非线性函数ReLU,k为卷积 卷积核大小,表示纵向卷积操作词语个数,本发明采用长度为3、4、5三种卷 积核长度,计算得到的该窗口的局部语义特征向量hi,最终输出三个序列局部 语义特征向量矩阵。Among them, w∈R d*k is the filter, b∈R is the bias, f is the nonlinear function ReLU, k is the size of the convolution convolution kernel, and represents the number of words in the vertical convolution operation. 4 and 5 three convolution kernel lengths, the calculated local semantic feature vector h i of the window, and finally output three sequence local semantic feature vector matrices.
·注意力层。引入注意力机制对通过三个卷积核得到的序列局部语义特征 向量矩阵进行计算。将序列局部语义特征向量的均值作为辅助信息:· Attention layer. The attention mechanism is introduced to calculate the local semantic feature vector matrix of the sequence obtained through three convolution kernels. Take the mean of the local semantic feature vectors of the sequence as auxiliary information:
采用注意力机制:Using attention mechanism:
mj=vTtanh(Wshj+Whh′+b)m j =v T tanh(W s h j +W h h'+b)
c=∑laihj c=∑ l a i h j
通过上述的注意力机制得到一系列特征信息码ci,其中ai表示局部语 义特征向量的注意力权重,Ws和Wh为权重矩阵,b为偏置,hj为局部语义 特征向量,mj表示sj和h’之间的匹配得分。A series of feature information codes c i are obtained through the above attention mechanism, where a i represents the attention weight of the local semantic feature vector, W s and W h are the weight matrix, b is the bias, h j is the local semantic feature vector, m j represents the matching score between s j and h'.
·池化层。对注意力层输出的特征信息码c采用max-pooling(最大池化) 方法,丢弃一些与主题弱相关的特征,得到每个序列特征信息码的最大 值,这样可以有效避免两个输入词个数不一致问题,最终池化层输出各 个特征信息码的最大值,连接后得到一个一维向量的最终语义表示向量;· Pooling layer. The max-pooling (maximum pooling) method is used for the feature information code c output by the attention layer, and some features that are weakly related to the topic are discarded, and the maximum value of each sequence feature information code is obtained, which can effectively avoid two input words. For the problem of inconsistency in numbers, the final pooling layer outputs the maximum value of each feature information code, and after connection, a final semantic representation vector of a one-dimensional vector is obtained;
·输出层。输出层主要实现余弦相似度匹配,通过池化层得到待匹配序列 的最终语义表示向量,根据以下公式来计算余弦相似度得到匹配得分l, 阈值设置为0.85,计算所得匹配得分l大于0.85则认为匹配成功,表示 是同一种实体,小于0.85则认为匹配失败,表示是两种实体。· Output layer. The output layer mainly realizes cosine similarity matching. The final semantic representation vector of the sequence to be matched is obtained through the pooling layer. The cosine similarity is calculated according to the following formula to obtain the matching score l. The threshold is set to 0.85. If the calculated matching score l is greater than 0.85, it is considered that If the match is successful, it means that it is the same entity. If it is less than 0.85, it is considered that the match fails, indicating that there are two types of entities.
当然,以上的阈值仅仅是示例性的,完全可以按需采用其他阈值来进行概 率匹配。Of course, the above thresholds are only exemplary, and other thresholds can be used to perform probability matching as needed.
根据本发明的一个实施例,在(2)对不同、异构数据源中的属性词汇进行 统一中,如果目标数据源中存在匹配成功的媒资实体,则对其属性进行更新; 若匹配失败,则说明目标数据源中不存在该媒资实体,需将该实体添加到目标 数据源中。在实体匹配成功后进行属性值对齐更新的时候,存在属性值歧义的 情况,保留最新更新的属性值。具体而言,在属性统一中,对三元组中的“属性” 值进行更新。According to an embodiment of the present invention, in (2) unifying attribute vocabulary in different and heterogeneous data sources, if there is a successfully matched media resource entity in the target data source, update its attributes; if the matching fails , it means that the media asset entity does not exist in the target data source, and the entity needs to be added to the target data source. When the attribute value is aligned and updated after the entity matching is successful, there is an attribute value ambiguity, and the newly updated attribute value is retained. Specifically, in attribute unification, the "attribute" value in the triple is updated.
举例而言,如果某个视频A既存在于百度百科,也存在于豆瓣电影,在进 行如上所述的实体对齐后,可将来自百度百科和豆瓣电影两者的针对视频A的 属性进行统一,并更新针对视频A的三元组中的属性值。即,如果该三元组是 基于来自百度百科的词条所构建的,则用豆瓣电影中的数据对该百度词条属性 进行更新。For example, if a certain video A exists in both Baidu Baike and Douban Movie, after performing entity alignment as described above, the attributes for video A from Baidu Baike and Douban Movie can be unified, and Update attribute values in triples for video A. That is, if the triplet is constructed based on the entry from Baidu Baike, the Baidu entry attribute is updated with the data in Douban Movie.
根据本发明的一个实施例,在进行实体对齐和属性统一后,可将实体关键 词(媒资标签)纳入属性值作为补充,以进一步更新针对实体的三元组。例如, 三元组中的属性值将通过纳入媒资标签来被进一步更新。According to an embodiment of the present invention, after entity alignment and attribute unification, entity keywords (media asset tags) may be included in attribute values as supplements to further update triples for entities. For example, attribute values in triples will be further updated by including media asset tags.
根据本发明的一个实施例,知识图谱构建模块103被配置成根据融合后的 媒资信息来构建知识图谱。具体而言,知识图谱构建模块103被配置成根据最 终得到的以实体为中心的三元组用图数据库Neo4j来存储和构建媒资知识图谱。 本领域的技术人员完全可以理解如何利用图数据库Neo4j来构建知识图谱,此 构建方式不在本发明的保护范围之内。According to an embodiment of the present invention, the knowledge graph building module 103 is configured to build a knowledge graph according to the fused media asset information. Specifically, the knowledge graph building module 103 is configured to store and build a media resource knowledge graph with the graph database Neo4j according to the finally obtained entity-centric triples. Those skilled in the art can fully understand how to use the graph database Neo4j to construct a knowledge graph, and this construction method is not within the protection scope of the present invention.
在实践中,在构建媒资知识图谱之后,可以利用D3技术实现媒资数据知 识图谱网页的可视化。具体而言,可以根据网页输入(例如,来自用户)的关 键词进行Cypher语句查询,将Neo4j查询的结果封装成JSON,传递给D3完 成画图并在网页上展示。In practice, after constructing the knowledge graph of media assets, D3 technology can be used to realize the visualization of media asset data knowledge graph web pages. Specifically, the Cypher statement query can be performed according to the keywords input from the web page (for example, from the user), the result of the Neo4j query can be encapsulated into JSON, and passed to D3 to complete the drawing and display on the web page.
图3示出了根据本发明的一个实施例的用于面向多源媒资数据的知识图谱 构建的方法300的流程图。在本发明的上下文中,多源媒资数据可例如来自半 结构化的网页和/或非结构化的网页。当然,多源媒资数据还可来自于其他类型 的源,诸如特定的媒体库等。FIG. 3 shows a flowchart of a
在步骤301,从多源数据中抽取媒资信息,以形成以实体为中心的知识图 谱三元组。根据本发明的一个实施例,抽取媒资信息包括实体抽取、关系抽取 和属性抽取。根据本发明的一个实施例,抽取媒资信息还包括对网页进行解析, 以提取用于建立媒资标签库的关键词和抽取结构化信息。In step 301, media asset information is extracted from multi-source data to form entity-centric knowledge graph triples. According to an embodiment of the present invention, extracting media asset information includes entity extraction, relation extraction and attribute extraction. According to an embodiment of the present invention, extracting the media asset information further includes parsing the web page to extract keywords for establishing a media asset tag library and extracting structured information.
具体而言,通过TextRank算法提取网页关键词作为媒资标签库;根据字体 特性抓取网页的词条名,也就是实体名称。根据本发明的一个实施例,针对电 视行业媒资信息特点,将媒资实体划分为视听类实体、非视听类实体和人物实 体,通过这三类实体间的关联关系确定各实体之间的关系,并获取与实体和关 系对应的属性。接着,从获取的结构化信息中抽取构建知识图谱的三元组,以 形成以实体为中心的三元组{实体,关系,属性}。Specifically, the keywords of the webpage are extracted as the media asset tag library through the TextRank algorithm; the entry name of the webpage, that is, the entity name, is grabbed according to the font characteristics. According to an embodiment of the present invention, according to the characteristics of media asset information in the television industry, the media asset entities are divided into audiovisual entities, non-audiovisual entities and character entities, and the relationship between the entities is determined through the association relationship between these three types of entities , and get properties corresponding to entities and relationships. Next, the triples for constructing the knowledge graph are extracted from the acquired structured information to form entity-centered triples {entity, relationship, attribute}.
在步骤302,对步骤301抽取的媒资信息的进行信息融合,以形成经更新 的三元组。根据本发明的一个实施例,步骤302进一步包括实体对齐和属性统 一。在实体对齐中,根据实体属性,对实体进行二级分类,缩小实体匹配范围。 并且,通过一种基于注意力机制的卷积神经网络的实体匹配模型对实体、实体 属性、实体关键词进行语义匹配,以将实体对齐。在实体对齐之后,对来自不 同数据源的属性词汇进行统一,同时将媒资标签作为属性词汇补充,以更新以 实体为中心的三元组。In step 302, information fusion is performed on the media asset information extracted in step 301 to form an updated triplet. According to one embodiment of the present invention, step 302 further includes entity alignment and attribute unification. In entity alignment, according to entity attributes, the entity is classified into two levels, and the scope of entity matching is narrowed. And, through an entity matching model based on an attention mechanism convolutional neural network, semantic matching is performed on entities, entity attributes, and entity keywords to align entities. After entity alignment, attribute vocabularies from different data sources are unified, and media asset tags are supplemented as attribute vocabularies to update entity-centric triples.
在步骤303,基于经更新的三元组来构建媒资知识图谱。根据本发明的一 个实施例,利用图数据库Neo4j来存储和构建媒资知识图谱。At step 303, a media asset knowledge graph is constructed based on the updated triples. According to an embodiment of the present invention, a graph database Neo4j is used to store and build a knowledge graph of media assets.
综上,本发明和现有技术相比,主要优势在于:To sum up, compared with the prior art, the present invention has the following main advantages:
1.具备高效性:相对于人工匹配的媒资管理系统,使用知识图谱算法能实 现媒资内容自动关联,提高搜索、推荐精确性;1. Efficient: Compared with the manual matching media asset management system, the use of knowledge graph algorithm can realize the automatic association of media asset content and improve the accuracy of search and recommendation;
2.具备通用性:媒资数据来自一般的网页,例如百度百科和豆瓣电影,数 据完整、实时,适用于各种应用场景的媒资业务平台;2. Versatility: Media asset data comes from general web pages, such as Baidu Baike and Douban Movies, with complete and real-time data, suitable for media asset business platforms of various application scenarios;
3.具备学习能力:在实体对齐中使用TextRank算法和神经网络技术,能够 通过训练不断自动调优,最终实体匹配的准确性不断提高;3. Possess learning ability: Using TextRank algorithm and neural network technology in entity alignment, it can be automatically tuned through training, and the accuracy of final entity matching is continuously improved;
4.具备人工经验纠正能力:在实体对齐中利用实体属性,对实体进行二级 分类,缩小实体匹配范围,能够基于人工经验有效降低无监督机器学习算法匹 配错误率。4. It has the ability to correct by artificial experience: Use entity attributes in entity alignment to perform secondary classification of entities, narrow the scope of entity matching, and effectively reduce the matching error rate of unsupervised machine learning algorithms based on human experience.
图4出了根据本发明的一个实施例的示例性计算设备400的框图,该计算 设备400是可应用于本发明的各方面的硬件设备的一个示例。计算设备400可 以是可被配置成用于实现处理和/或计算的任何机器,可以是但并不局限于工作 站、服务器、桌面型计算机、膝上型计算机、平板计算机、个人数字处理、智 能手机、车载计算机或者它们的任何组合。计算设备400可包括可经由一个或 多个接口和总线402连接或通信的组件。例如,计算设备400可包括总线402、 一个或多个处理器404、一个或多个输入设备406以及一个或多个输出设备408。 该一个或多个处理器404可以是任何类型的处理器并且可包括但不限于一个或多个通用处理器和/或一个或多个专用处理器(例如,专门的处理芯片)。输入 设备406可以是任何类型的能够向计算设备输入信息的设备并且可以包括但不 限于鼠标、键盘、触摸屏、麦克风和/或远程控制器。输出设备408可以是任何 类型的能够呈现信息的设备并且可以包括但不限于显示器、扬声器、视频/音频 输出终端、振动器和/或打印机。计算设备400也可以包括非瞬态存储设备410 或者与所述非瞬态存储设备相连接,所述非瞬态存储设备可以是非瞬态的并且 能够实现数据存储的任何存储设备,并且所述非瞬态存储设备可以包括但不限 于磁盘驱动器、光存储设备、固态存储器、软盘、软磁盘、硬盘、磁带或任何 其它磁介质、光盘或任何其它光介质、ROM(只读存储器)、RAM(随机存取 存储器)、高速缓冲存储器和/或任何存储芯片或盒式磁带、和/或计算机可从其 读取数据、指令和/或代码的任何其它介质。非瞬态存储设备410可从接口分离。 非瞬态存储设备410可具有用于实施上述方法和步骤的数据/指令/代码。计算 设备400也可包括通信设备412。通信设备412可以是任何类型的能够实现与 内部装置通信和/或与网络通信的设备或系统并且可以包括但不限于调制解调 器、网卡、红外通信设备、无线通信设备和/或芯片组,例如蓝牙设备、IEEE1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似设备。4 illustrates a block diagram of an
总线402可以包括但不限于工业标准结构(ISA)总线、微通道结构(MCA) 总线、增强型ISA(EISA)总线、视频电子标准协会(VESA)局部总线和外部 设备互连(PCI)总线。The
计算设备400还可包括工作存储器414,该工作存储器414可以是任何类 型的能够存储有利于处理器404的工作的指令和/或数据的工作存储器并且可 以包括但不限于随机存取存储器和/或只读存储设备。
软件组件可位于工作存储器414中,这些软件组件包括但不限于操作系统 416、一个或多个应用程序418、驱动程序和/或其它数据和代码。用于实现本发 明上述方法和步骤的指令可包含在所述一个或多个应用程序418中,并且可通 过处理器404读取和执行所述一个或多个应用程序418的指令来实现本发明的 上述方法300。Software components may be located in working
也应该认识到可根据具体需求而做出变化。例如,也可使用定制硬件、和 /或特定组件可在硬件、软件、固件、中间件、微代码、硬件描述语音或其任何 组合中实现。此外,可采用与其它计算设备、例如网络输入/输出设备等的连接。 例如,可通过具有汇编语言或硬件编程语言(例如,VERILOG、VHDL、C++) 的编程硬件(例如,包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA) 的可编程逻辑电路)利用根据本发明的逻辑和算法来实现所公开的方法和设备 的部分或全部。It should also be recognized that variations may be made according to specific needs. For example, custom hardware may also be used, and/or certain components may be implemented in hardware, software, firmware, middleware, microcode, hardware description voice, or any combination thereof. Additionally, connections to other computing devices, such as network input/output devices, etc., may be employed. For example, programmable logic circuits (eg, programmable logic circuits including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs) may be programmed with assembly language or hardware programming languages (eg, VERILOG, VHDL, C++). ) utilize logic and algorithms in accordance with the present invention to implement some or all of the disclosed methods and apparatus.
尽管目前为止已经参考附图描述了本发明的各方面,但是上述方法和设备 仅是示例,并且本发明的范围不限于这些方面,而是仅由所附权利要求及其等 同物来限定。各种组件可被省略或者也可被等同组件替代。另外,也可以在与 本发明中描述的顺序不同的顺序实现所述步骤。此外,可以按各种方式组合各 种组件。也重要的是,随着技术的发展,所描述的组件中的许多组件可被之后 出现的等同组件所替代。Although aspects of the present invention have so far been described with reference to the accompanying drawings, the above-described method and apparatus are merely examples, and the scope of the present invention is not limited to these aspects, but only by the appended claims and their equivalents. Various components may be omitted or may be replaced by equivalent components. Additionally, the steps may also be performed in an order different from that described in this disclosure. Furthermore, the various components can be combined in various ways. It is also important that, as technology develops, many of the components described may be replaced by equivalent components that appear later.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111614178.1A CN114911944A (en) | 2021-12-27 | 2021-12-27 | A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111614178.1A CN114911944A (en) | 2021-12-27 | 2021-12-27 | A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114911944A true CN114911944A (en) | 2022-08-16 |
Family
ID=82762555
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111614178.1A Pending CN114911944A (en) | 2021-12-27 | 2021-12-27 | A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114911944A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116910560A (en) * | 2023-07-31 | 2023-10-20 | 中移(杭州)信息技术有限公司 | Training method of media asset tag prediction model and media asset tag prediction method |
| CN119740649A (en) * | 2025-03-03 | 2025-04-01 | 北京大学 | Software knowledge graph construction method and system for low-code template recommendation |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111291139A (en) * | 2020-03-17 | 2020-06-16 | 中国科学院自动化研究所 | A Completion Method for Long-tail Relationships in Knowledge Graph Based on Attention Mechanism |
| CN111444351A (en) * | 2020-03-24 | 2020-07-24 | 清华苏州环境创新研究院 | A method and device for constructing knowledge graph in industrial process field |
| CN111475629A (en) * | 2020-03-31 | 2020-07-31 | 渤海大学 | A knowledge graph construction method and system for mathematics tutoring question answering system |
| CN111598702A (en) * | 2020-04-14 | 2020-08-28 | 徐佳慧 | Knowledge graph-based method for searching investment risk semantics |
| CN112860916A (en) * | 2021-03-09 | 2021-05-28 | 齐鲁工业大学 | Movie-television-oriented multi-level knowledge map generation method |
| CN112948510A (en) * | 2021-04-21 | 2021-06-11 | 央视国际网络无锡有限公司 | Construction method of knowledge graph in media industry |
| WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
-
2021
- 2021-12-27 CN CN202111614178.1A patent/CN114911944A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111291139A (en) * | 2020-03-17 | 2020-06-16 | 中国科学院自动化研究所 | A Completion Method for Long-tail Relationships in Knowledge Graph Based on Attention Mechanism |
| CN111444351A (en) * | 2020-03-24 | 2020-07-24 | 清华苏州环境创新研究院 | A method and device for constructing knowledge graph in industrial process field |
| WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
| CN111475629A (en) * | 2020-03-31 | 2020-07-31 | 渤海大学 | A knowledge graph construction method and system for mathematics tutoring question answering system |
| CN111598702A (en) * | 2020-04-14 | 2020-08-28 | 徐佳慧 | Knowledge graph-based method for searching investment risk semantics |
| CN112860916A (en) * | 2021-03-09 | 2021-05-28 | 齐鲁工业大学 | Movie-television-oriented multi-level knowledge map generation method |
| CN112948510A (en) * | 2021-04-21 | 2021-06-11 | 央视国际网络无锡有限公司 | Construction method of knowledge graph in media industry |
Non-Patent Citations (1)
| Title |
|---|
| 李向华: "汉语语用移情优选机制及其应用研究", vol. 1, 31 October 2021, 上海交通大学出版社, pages: 266 - 275 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116910560A (en) * | 2023-07-31 | 2023-10-20 | 中移(杭州)信息技术有限公司 | Training method of media asset tag prediction model and media asset tag prediction method |
| CN119740649A (en) * | 2025-03-03 | 2025-04-01 | 北京大学 | Software knowledge graph construction method and system for low-code template recommendation |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12412044B2 (en) | Methods for reinforcement document transformer for multimodal conversations and devices thereof | |
| US11675977B2 (en) | Intelligent system that dynamically improves its knowledge and code-base for natural language understanding | |
| US10831796B2 (en) | Tone optimization for digital content | |
| US9613093B2 (en) | Using question answering (QA) systems to identify answers and evidence of different medium types | |
| US12032915B2 (en) | Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model | |
| US11699034B2 (en) | Hybrid artificial intelligence system for semi-automatic patent infringement analysis | |
| JP2020123318A (en) | Method, apparatus, electronic device, computer-readable storage medium and computer program for determining text correlation | |
| US9514098B1 (en) | Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases | |
| US12099537B2 (en) | Electronic device, contents searching system and searching method thereof | |
| CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
| CN111523019B (en) | Method, apparatus, device and storage medium for outputting information | |
| Kroon et al. | Advancing automated content analysis for a new era of media effects research: The key role of transfer learning | |
| CN108287875A (en) | Personage's cooccurrence relation determines method, expert recommendation method, device and equipment | |
| CN110705304B (en) | An attribute word extraction method | |
| CN111813993A (en) | Video content expanding method and device, terminal equipment and storage medium | |
| CN114911944A (en) | A method for building knowledge graph of multi-source media resources based on TextRank algorithm combined with convolutional neural network model | |
| US11507593B2 (en) | System and method for generating queryeable structured document from an unstructured document using machine learning | |
| Khan et al. | Urdu sentiment analysis | |
| CN114912011B (en) | A video recommendation method based on content extraction and rating prediction | |
| CN112199487A (en) | Knowledge graph-based film question-answer query system and method thereof | |
| KR101602342B1 (en) | Method and system for providing information conforming to the intention of natural language query | |
| Maree | Multimedia context interpretation: a semantics-based cooperative indexing approach | |
| CN116955703A (en) | Video search method and device | |
| Pavithra et al. | Aspect-Based Sentiment Analysis: An Extensive Study of Techniques, Challenges, and Applications | |
| CN115391542A (en) | Classification model training method, text classification method, device and equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |