CN107368506A - Unstructured data analysis system and method - Google Patents
Unstructured data analysis system and method Download PDFInfo
- Publication number
- CN107368506A CN107368506A CN201610496280.9A CN201610496280A CN107368506A CN 107368506 A CN107368506 A CN 107368506A CN 201610496280 A CN201610496280 A CN 201610496280A CN 107368506 A CN107368506 A CN 107368506A
- Authority
- CN
- China
- Prior art keywords
- data
- topics
- topic
- unstructured data
- unstructured
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
相关申请的交叉引用Cross References to Related Applications
本专利申请/专利要求共同未决的在2015年5月11日递交的标题为“UNSTRUCTUREDDATA ANALYTICS SYSTEMS AND METHODS INCLUDING A VISUALIZATION INTERFACE”的美国临时专利申请No.62/159,662以及在2015年5月11日递交的标题为“UNSTRUCTURED DATAANALYTICS SYSTEMS AND METHODS INCLUDING NATURAL LANGUAGE PROCESSING ANDSTATISTICS FUNCTIONS”的美国临时专利申请No.62/159,683的优先权,通过引用将这二者的全部内容并入本文。This patent application/patent claims co-pending U.S. Provisional Patent Application No. 62/159,662, filed May 11, 2015, entitled "UNSTRUCTUREDDATA ANALYTICS SYSTEMS AND METHODS INCLUDING A VISUALIZATION INTERFACE," and filed on May 11, 2015 Priority to U.S. Provisional Patent Application No. 62/159,683, filed entitled "UNSTRUCTURED DATAANALYTICS SYSTEMS AND METHODS INCLUDING NATURAL LANGUAGE PROCESSING AND STATISTICS FUNCTIONS," both of which are incorporated herein by reference in their entirety.
技术领域technical field
本发明一般涉及用于分析大文本语料和非结构化数据的方法和系统。更具体地,本发明涉及使用可视化分析和话题建模、可视化界面、以及自然语言处理及统计功能分析大文本语料和非结构化数据的方法和系统。The present invention generally relates to methods and systems for analyzing large text corpora and unstructured data. More specifically, the present invention relates to methods and systems for analyzing large text corpora and unstructured data using visual analysis and topic modeling, visual interfaces, and natural language processing and statistical capabilities.
背景技术Background technique
对文本信息和非结构化数据的大量且日益增长的集合的管理是一个挑战性的问题。知识丰富的文本信息的数据存储库已经变得普及,导致要整理、发掘和分析海量数据。随着文档数量的增加,学习文本语料的含义变得认知成本高昂且耗时。The management of large and growing collections of textual information and unstructured data is a challenging problem. Data repositories of knowledge-rich textual information have become ubiquitous, resulting in massive amounts of data to be collated, mined, and analyzed. As the number of documents increases, learning the meaning of a text corpus becomes cognitively costly and time-consuming.
对于自然语言处理(NLP)领域中的研究者,对大文本语料的自动总结这一挑战已经成为主要关注点。为了总结文本语料,研究者已经开发了诸如用于提取并表示词语的上下文使用环境下的含义的隐含语义分析(LSA)之类的技术。LSA产生可以用于文档分类和聚类的概念空间。最近,已经出现了作为用于找到非结构化文本集合中的语义上有意义的话题的有利新技术的概率话题模型。为了进一步提供对文本语料的可视化总结,来自知识发现和可视化社区领域的研究者已经开发了用以基于LSA和概率话题模型(probabilistictopic models)二者支持大文本语料的可视化(visualization)和探索(exploration)的工具和技术。The challenge of automatic summarization of large text corpora has become a major concern for researchers in the field of natural language processing (NLP). To summarize text corpora, researchers have developed techniques such as Latent Semantic Analysis (LSA) for extracting and representing the contextual meaning of words. LSA produces a concept space that can be used for document classification and clustering. Recently, probabilistic topic models have emerged as an advantageous new technique for finding semantically meaningful topics in unstructured text collections. To further provide visual summaries of text corpora, researchers from the knowledge discovery and visualization community have developed tools to support the visualization and exploration of large text corpora based on both LSA and probabilistic topic models. ) tools and techniques.
尽管概率话题模型已经在解释和语义关联方面证明了它们的优势,但是几乎没有交互可视化系统已经利用这种模型来支持对文本语料的探索和分析。基于范例的可视化和概率隐含语义可视化方法已经在估计文本语料的话题的同时将文档投射到语义二维(2D)图表上。尽管文档簇良好地服从所选标签,但是几乎不存在对文档簇的交互探索和分析的机会。一个例外是基于时间的可视化系统TIARA,其应用河流图(ThemeRiver)隐喻以基于话题内容可视化地总结文本集合。通过TIARA系统的分析,用户能够回答诸如以下问题:文档语料中的主要话题是什么?以及话题是如何随时间演进的?Although probabilistic topic models have demonstrated their strengths in terms of interpretation and semantic association, few interactive visualization systems have exploited such models to support the exploration and analysis of text corpora. Example-based visualization and probabilistic implicit semantic visualization methods have projected documents onto semantic two-dimensional (2D) graphs while estimating the topics of text corpora. Although document clusters obey selected labels well, there is little opportunity for interactive exploration and analysis of document clusters. One exception is the time-based visualization system TIARA, which applies the ThemeRiver metaphor to visually summarize text collections based on topical content. Through the analysis of the TIARA system, users can answer questions such as: What are the main topics in the document corpus? And how has the topic evolved over time?
然而,当分析大文本语料时,存在当前文本分析可视化系统难以回答的许多其他现实世界问题。具体地,关于话题和文档之间的关系的问题难以用现有工具解答。这种问题包括:基于文档的话题分布的文档特征是什么?以及什么文档一次包括多个话题(以及这多个话题是什么)?在科学策略的领域中,例如具有多个话题的文档可以指示跨学科的(即,涵盖多于一个知识体)出版物。类似地,在社交媒体分析的上下文中,具有多个话题的文档可以表示与不同的热点话题相关的唯一新闻文章。However, when analyzing large text corpora, there are many other real-world questions that are difficult to answer with current text analysis visualization systems. Specifically, questions about the relationship between topics and documents are difficult to answer with existing tools. Such questions include: What are the document characteristics of the document-based topic distribution? And what document includes multiple topics at once (and what are these multiple topics)? In the domain of scientific policy, for example, documents with multiple topics may indicate interdisciplinary (ie, covering more than one body of knowledge) publications. Similarly, in the context of social media analysis, documents with multiple topics can represent unique news articles related to different hot topics.
为了克服与现有的方法和系统相关联的缺点,以及为了帮助用户更有效地理解大文本语料,本发明提供新颖的可视化分析系统,其将最新的概率话题模型、隐含狄利克雷分布(LDA)与交互可视化整合。为了描述文档语料,本发明的方法和系统首先使用LDA提取一组语义上有意义的话题。与将文档指派给特定簇的大多数传统聚类技术不同,LDA模型考虑每个单独文档的不同话题方面。这准许实现对可包含多个话题的较大文档的高效全面文本分析。为了突出模型的该性质,本发明的方法和系统利用并行坐标隐喻来呈现跨话题文档的概率分布。这种呈现允许用户发现单话题与多话题文档,以及每个话题对于关注的文档的相对重要性。此外,由于大多数文本语料本身是有时间性的,本发明的系统和方法还示出了随时间的话题演进。In order to overcome the shortcomings associated with existing methods and systems, and to help users understand large text corpora more effectively, the present invention provides a novel visual analysis system that integrates the latest probabilistic topic models, latent Dirichlet distribution ( LDA) integrated with interactive visualization. In order to describe the document corpus, the method and system of the present invention first use LDA to extract a set of semantically meaningful topics. Unlike most traditional clustering techniques that assign documents to specific clusters, LDA models consider different topical aspects of each individual document. This allows efficient comprehensive text analysis of larger documents that can contain multiple topics. To highlight this property of the model, the method and system of the present invention exploit the parallel coordinates metaphor to represent probability distributions across topic documents. This presentation allows users to discover single-topic versus multi-topic documents, as well as the relative importance of each topic to the document of interest. Furthermore, since most text corpora are inherently temporal, the systems and methods of the present invention also show topic evolution over time.
此外,本发明使包括分析师、营销人员、商业单元领导、信息技术人员和C型雇员在内的公司能够从任何类型的文本数据获得可操作的见解。该技术允许人们根据数据驱动的基础来增强他们的决策过程。该技术摄取文本数据,并通过深度计算和统计算法,识别每个数据集内的主题、话题和出现的问题。用交互的可视化的格式显示结果,使得公司中的任何人能够整体地或精细地分析数据。可以分析所有类型的文本数据-内部数据(例如电子邮件、聊天、调查、呼叫中心和关注小组),或外部数据(例如社会媒体、评论网站、论坛和新闻网站)。该技术可以处理大量语言,确保可以分析来自全世界的反馈环。然而,令人调整分析效果的高度可定制的特征被选择。大多数公司正坐在非结构化文本数据的宝藏上,但是几乎没有能力挖掘非结构化文本数据取得情报。Furthermore, the present invention enables companies including analysts, marketers, business unit leaders, information technology staff, and C-type employees to gain actionable insights from any type of textual data. The technology allows people to enhance their decision-making process on a data-driven basis. The technology ingests text data and uses deep computing and statistical algorithms to identify themes, topics and emerging issues within each data set. Display results in an interactive visual format, enabling anyone in the company to analyze data holistically or granularly. All types of text data can be analyzed - internal data such as emails, chats, surveys, call centers and focus groups, or external data such as social media, review sites, forums and news sites. The technology can handle a large number of languages, ensuring feedback loops from all over the world can be analyzed. However, highly customizable features are selected that allow tuning of the analysis effects. Most companies are sitting on a treasure trove of unstructured text data, but have little ability to mine it for intelligence.
发明内容Contents of the invention
再次,在各示例实施例中,本发明的方法和系统将交互可视化与最新的概率话题模型紧密整合。具体地,为了解决本文上面提出的问题,本发明的方法和系统利用并行坐标(PC)隐喻来呈现跨话题文档的概率分布。该精心挑选的呈现不仅示出了文档与多少话题相关,还示出了每个话题对文档的重要性。此外,本发明的方法和系统提供了可以帮助用户基于文档中的话题数自动划分文档集合的一组丰富的交互。除了示出话题和文档之间的关系之外,本发明的方法和系统还支持对于理解文档集合必要的其他任务,例如总结文档集合的主要话题,并示出话题随时间如何演进。Again, in various example embodiments, the methods and systems of the present invention tightly integrate interaction visualization with state-of-the-art probabilistic topic models. Specifically, to address the issues posed above in this paper, the method and system of the present invention exploit the Parallel Coordinates (PC) metaphor to represent the probability distribution of cross-topic documents. This curated presentation not only shows how many topics the document is related to, but also how important each topic is to the document. Furthermore, the method and system of the present invention provide a rich set of interactions that can help users automatically partition document collections based on the number of topics in the documents. In addition to showing the relationships between topics and documents, the method and system of the present invention also support other tasks necessary for understanding a collection of documents, such as summarizing the main topics of a collection of documents and showing how topics evolve over time.
本发明的方法和系统在分析大文本语料时可以有效解决的问题集合包括:捕获文档集合的主要话题是什么?基于文档的话题分布的文档特征是什么?什么文档一次涉及多个话题?以及关注的话题如何随时间演进?为了帮助用户回答这些问题,本发明的方法和系统首先使用LDA模型提取一组语义上有意义的话题。为了支持基于话题模型的对文档集合的可视化探索,本发明的方法和系统采用多个协调视图来突出文档语料的话题和时间特征二者。本发明的方法和系统的一个新颖贡献在于:对文档按话题的概率分布的描绘,并支持对单话题和多话题文档的交互识别和更详细的检查。The problem set that the method and system of the present invention can effectively solve when analyzing a large text corpus includes: What is the main topic of the captured document collection? What are the document characteristics for document-based topic distribution? What document covers multiple topics at once? And how have topics of interest evolved over time? In order to help users answer these questions, the method and system of the present invention first use the LDA model to extract a set of semantically meaningful topics. To support topic model-based visual exploration of document collections, the method and system of the present invention employ multiple coordinated views to highlight both topical and temporal features of a document corpus. A novel contribution of the method and system of the present invention lies in the depiction of the probability distribution of documents by topic, and supports interactive identification and more detailed inspection of single-topic and multi-topic documents.
在一个示例实施例中,本发明提供用于文本数据分析的计算机化的方法,包括:在一个或更多个处理器处从一个或更多个存储器接收要分析的文本数据;使用该一个或更多个处理器对文本数据进行格式化以供后续分析;使用该一个或更多个处理器,向文本数据应用概率话题模型以提取出一组语义上有意义的话题,这组语义上有意义的话题共同描述了文本数据的全部或一部分;使用在该一个或更多个处理器上执行的关键词加权模块,生成将话题表示为标签云的话题云视图,其中每个标签云与多个关键词相关联;使用在该一个或更多个处理器上执行的话题排序模块,生成表示文本数据的全部或一部分在多个话题上的分布的文档分布视图;使用在该一个或更多个处理器上执行的文档熵计算模块,生成表示多少话题可归属于本文数据的全部或一部分的文档散点图视图;使用在该一个或更多个处理器上执行的临时话题趋势计算模块,生成表示关于文本数据的全部或一部分而言话题的发生随时间改变的时间视图;以及在对文本数据的全部或一部分的分析中,向用户显示话题云视图、文档分布视图、文档散点图视图和时间视图中的一个或更多个。文本数据包括下述中的一个或更多个:从多个文档导出的文本数据、从多个文件导出的文本数据、从一个或多个数据存储库导出的文本数据、以及从互联网导出的文本数据。概率话题模型产生一组隐含话题并将每个话题表示为在多个关键词上的多项分布。文本数据被描述为话题的概率混合。可选地,对关键词排序以指示它们对于给定话题的重要性和彼此间的关系。可选地,突出关键词以指示它们对多个话题的重要性。对话题排序,以表示它们的关系。本文还提供各种其他示例功能。In an example embodiment, the present invention provides a computerized method for textual data analysis, comprising: receiving, at one or more processors, textual data to be analyzed from one or more memories; using the one or more A further plurality of processors formats the text data for subsequent analysis; using the one or more processors, a probabilistic topic model is applied to the text data to extract a set of semantically meaningful topics, the set of semantically meaningful Meaningful topics collectively describe all or a portion of the text data; using a keyword weighting module executing on the one or more processors, a topic cloud view representing topics as tag clouds is generated, where each tag cloud is associated with multiple keywords; use the topic ranking module executed on the one or more processors to generate a document distribution view representing the distribution of all or part of the text data on multiple topics; use the one or more processors to generate a document distribution view A document entropy calculation module executed on one or more processors generates a document scatter diagram view representing how many topics can be attributed to all or part of the data in this article; using a temporary topic trend calculation module executed on the one or more processors, Generate a time view representing the occurrence of topics over time with respect to all or a portion of the text data; and in the analysis of all or a portion of the text data, display a topic cloud view, a document distribution view, a document scatter diagram view to the user and one or more of the time views. The text data includes one or more of: text data exported from documents, text data exported from files, text data exported from one or more data repositories, and text exported from the Internet data. Probabilistic topic models generate a set of latent topics and represent each topic as a multinomial distribution over multiple keywords. Text data is described as a probabilistic mixture of topics. Optionally, keywords are ranked to indicate their importance to a given topic and relationship to each other. Optionally, keywords are highlighted to indicate their importance to multiple topics. Topics are ordered to represent their relationships. This article also provides various other sample functions.
在另一个示例实施例中,本发明提供用于文本数据分析的计算机化的方法,包括:一个或更多个存储器以及一个或更多个处理器,所述存储器可操作用于存储要分析的文本数据,所述处理器可操作用于接收要分析的文本数据;在该一个或更多个处理器上执行的算法,可操作用于:对文本数据进行格式化以供后续分析;在该一个或更多个处理器上执行的算法,可操作用于:向文本数据应用概率话题模型,以提取出一组语义上有意义的话题,该组语义上有意义的话题共同描述了文本数据的全部或一部分;在该一个或更多个处理器上执行的关键词加权模块,可操作用于:生成将话题表示为标签云的话题云视图,其中每个标签云与多个关键词相关联;在该一个或更多个处理器上执行的话题排序模块,可操作用于:生成表示文本数据的全部或一部分在多个话题上的分布的文档分布视图;在该一个或更多个处理器上执行的文档熵计算模块,可操作用于:生成表示多少话题可归属于本文数据的全部或一部分的文档散点图视图;在该一个或更多个处理器上执行的临时话题趋势计算模块,可操作用于:生成表示关于文本数据的全部或一部分而言的话题的发生随时间改变的时间视图;以及显示器可操作用于:在对文本数据的全部或一部分的分析中,向用户显示话题云视图、文档分布视图、文档散点图视图和时间视图中的一个或更多个。文本数据包括下述中的一个或更多个:从多个文档导出的文本数据、从多个文件导出的文本数据、从一个或多个数据存储库导出的文本数据、以及从互联网导出的文本数据。概率话题模型产生一组隐含话题,并将每个话题表示为在多个关键词上的多项分布。文本数据被描述为话题的概率混合。可选地,对关键词排序以指示它们对于给定话题的重要性和彼此间的关系。可选地,突出关键词以指示它们对多个话题的重要性。对话题排序以表示它们间的关系。本文还提供各种其他示例功能。In another example embodiment, the present invention provides a computerized method for textual data analysis, comprising: one or more memories and one or more processors, the memories operable to store data to be analyzed text data, the processor is operable to receive the text data for analysis; algorithms executing on the one or more processors are operable to: format the text data for subsequent analysis; an algorithm executing on one or more processors operable to: apply a probabilistic topic model to text data to extract a set of semantically meaningful topics that together describe the text data all or a portion of; a keyword weighting module executing on the one or more processors, operable to: generate a topic cloud view representing topics as tag clouds, where each tag cloud is associated with a plurality of keywords a topic ranking module executed on the one or more processors, operable to: generate a document distribution view representing distribution of all or a portion of the text data over a plurality of topics; among the one or more a document entropy calculation module executing on a processor operable to: generate a document scattergram view showing how many topics are attributable to all or a portion of the data herein; temporal topic trends executing on the one or more processors a computing module operable to: generate a temporal view representing changes in topics over time with respect to all or a portion of the text data; and a display operable to: contribute to the analysis of all or a portion of the text data The user displays one or more of a topic cloud view, a document distribution view, a document scatter diagram view, and a time view. The text data includes one or more of: text data exported from documents, text data exported from files, text data exported from one or more data repositories, and text exported from the Internet data. Probabilistic topic models generate a set of latent topics and represent each topic as a multinomial distribution over multiple keywords. Text data is described as a probabilistic mixture of topics. Optionally, keywords are ranked to indicate their importance to a given topic and relationship to each other. Optionally, keywords are highlighted to indicate their importance to multiple topics. Topics are ordered to represent their relationships. This article also provides various other sample functions.
再次,本发明使包括分析师、营销人员、商业单元领导、信息技术人员和C型雇员在内的公司能够从任何类型的文本数据获得可操作的见解。该技术允许人们根据数据驱动的基础增强他们的决策过程。该技术摄取文本数据,并通过深度计算和统计算法,识别每个数据集内的主题、话题和出现的问题。用交互的可视化的格式显示结果,使得公司中的任何人可以整体地或精细地分析数据。可以分析所有类型的文本数据-内部数据(例如电子邮件、聊天、调查、呼叫中心和关注小组),或外部数据(例如社会媒体、评论网站、论坛和新闻网站)。技术可以处理大量语言,确保可以分析来自全世界的反馈环。然而,令人调整分析效果的高度可定制的特征被选择。大多数公司正坐在非结构化文本数据的宝藏上,但是几乎没有能力挖掘非结构化文本数据取得情报。Again, the present invention enables companies including analysts, marketers, business unit leaders, IT staff, and C-type employees to gain actionable insights from any type of textual data. The technology allows people to enhance their decision-making process on a data-driven basis. The technology ingests text data and uses deep computing and statistical algorithms to identify themes, topics and emerging issues within each data set. Display results in an interactive visual format, allowing anyone in the company to analyze data holistically or granularly. All types of text data can be analyzed - internal data such as emails, chats, surveys, call centers and focus groups, or external data such as social media, review sites, forums and news sites. The technology can handle a large number of languages, ensuring that feedback loops from all over the world can be analyzed. However, highly customizable features are selected that allow tuning of the analysis effects. Most companies are sitting on a treasure trove of unstructured text data, but have little ability to mine it for intelligence.
在附加示例实施例中,本发明提供了一种非结构化数据分析系统,包括:非结构化数据分析算法,其驻留在服务器上并可经由浏览器访问,所述非结构化数据分析算法能够操作用于从一个或更多个远程源接收非结构化数据,向非结构化数据应用一个或更多个分析工具,以及向一个或更多个用户显示总结信息;其中在呈现(presentation)层、探索(exploration)层和注释层中的一个或更多个中向一个或更多个用户显示总结信息。非结构化数据包括下述中的一个或更多个:客户体验数据、电信数据、电子邮件数据、以及社交媒体数据。所述非结构化数据分析算法还可操作用于:从一个或更多个远程源接收外部数据。外部数据包括以下中的一个或更多个:互联网数据、政府数据、以及商业数据。向非结构化数据应用的一个或更多个分析工具包括以下中的一个或更多个:统计算法、机器学习和、自然语言处理、以及文本挖掘。呈现层显示以下中的一个或更多个:非结构化数据、非结构化数据的总结、以及总结信息。所述探索层允许一个或更多个用户修改总结信息的粒度,由此修改呈现层的粒度。一个或更多个用户可以经由注释层同时与非结构化数据分析系统交互。还在组合层中向一个或更多个用户显示总结信息。In additional example embodiments, the present invention provides an unstructured data analysis system comprising: an unstructured data analysis algorithm resident on a server and accessible via a browser, the unstructured data analysis algorithm Operable to receive unstructured data from one or more remote sources, apply one or more analytical tools to the unstructured data, and display summary information to one or more users; wherein in presentation Summary information is displayed to one or more users in one or more of an exploration layer, an exploration layer, and an annotation layer. Unstructured data includes one or more of: customer experience data, telecommunications data, email data, and social media data. The unstructured data analysis algorithm is further operable to: receive external data from one or more remote sources. External data includes one or more of: Internet data, government data, and commercial data. The one or more analytical tools applied to the unstructured data include one or more of: statistical algorithms, machine learning, natural language processing, and text mining. The presentation layer displays one or more of: unstructured data, a summary of the unstructured data, and summarized information. The exploration layer allows one or more users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer. One or more users can simultaneously interact with the unstructured data analysis system via the annotation layer. Summary information is also displayed to one or more users in the composite layer.
在另一附加示例实施例中,本发明提供了一种非结构化数据分析方法,包括:提供非结构化数据分析算法,其驻留在服务器上并可经由浏览器访问,所述非结构化数据分析算法能够操作用于从一个或更多个远程源接收非结构化数据,向非结构化数据应用一个或更多个分析工具,以及向一个或更多个用户显示总结信息;其中在呈现层、探索层和注释层中的一个或更多个中向一个或更多个用户显示总结信息。非结构化数据包括以下中的一个或更多个:客户体验数据、电信数据、电子邮件数据、以及社交媒体数据。所述非结构化数据分析算法还可操作用于:从一个或更多个远程源接收外部数据。外部数据包括以下中的一个或更多个:互联网数据、政府数据、以及商业数据。向非结构化数据应用的一个或更多个分析工具包括以下中的一个或更多个:统计算法、机器学习、自然语言处理、以及文本挖掘。呈现层显示以下中的一个或更多个:非结构化数据、非结构化数据的总结和总结信息中的一个或更多个。所述探索层允许一个或更多个用户修改总结信息的粒度,由此修改呈现层的粒度。一个或更多个用户可以经由注释层同时与非结构化数据分析系统交互。还在组合层中向一个或更多个用户显示总结信息。In another additional example embodiment, the present invention provides a method of analyzing unstructured data comprising: providing an algorithm for analyzing unstructured data, resident on a server and accessible via a browser, the unstructured data The data analysis algorithm is operable to receive unstructured data from one or more remote sources, apply one or more analytical tools to the unstructured data, and display summary information to one or more users; wherein in presenting The summary information is displayed to one or more users in one or more of layer, exploration layer, and annotation layer. Unstructured data includes one or more of: customer experience data, telecommunications data, email data, and social media data. The unstructured data analysis algorithm is further operable to: receive external data from one or more remote sources. External data includes one or more of: Internet data, government data, and commercial data. The one or more analytical tools applied to the unstructured data include one or more of: statistical algorithms, machine learning, natural language processing, and text mining. The presentation layer displays one or more of: one or more of unstructured data, a summary of the unstructured data, and summary information. The exploration layer allows one or more users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer. One or more users can simultaneously interact with the unstructured data analysis system via the annotation layer. Summary information is also displayed to one or more users in the composite layer.
附图说明Description of drawings
本文参照各附图示出并描述了本发明,附图中类似的参考符号用于视情况标识类似的方法步骤/系统组件,并且附图中:The present invention is illustrated and described herein with reference to the various drawings in which like reference numerals are used to identify like method steps/system components where appropriate, and in which:
图1是示出本发明的可视化文本语料分析工具的一个示例实施例的示意示图;Fig. 1 is a schematic diagram showing an example embodiment of the visual text corpus analysis tool of the present invention;
图2是示出本发明的可视化文本语料分析工具的话题云视图的示例显示;Figure 2 is an example display showing a topic cloud view of the visual text corpus analysis tool of the present invention;
图3是示出本发明的可视化文本语料分析工具的文档分布视图的示例显示;Figure 3 is an example display showing a document distribution view of the visual text corpus analysis tool of the present invention;
图4是根据本发明的方法和系统示出在一个话题、两个话题和多于两个话题上的文档分布的一系列图表;Figure 4 is a series of graphs showing the distribution of documents on one topic, two topics and more than two topics according to the method and system of the present invention;
图5是示出本发明的可视化文本语料分析工具的话题云视图的示例显示;Figure 5 is an example display showing a topic cloud view of the visual text corpus analysis tool of the present invention;
图6是示出本发明的可视化文本语料分析工具的时间视图的示例显示;以及Figure 6 is an example display showing a temporal view of the visual text corpus analysis tool of the present invention; and
图7是示出根据本发明的非结构化数据分析系统的一个示例实施例的示意示图;Figure 7 is a schematic diagram illustrating an example embodiment of an unstructured data analysis system according to the present invention;
图8是示出本发明的非结构化数据分析系统的另一示例实施例的示意示图;Figure 8 is a schematic diagram illustrating another example embodiment of the unstructured data analysis system of the present invention;
图9是示出本发明的非结构化的数据分析系统的附加示例实施例的示意示图;Figure 9 is a schematic diagram illustrating an additional example embodiment of the unstructured data analysis system of the present invention;
图10是示出本发明的非结构化数据分析系统的另一示例实施例的示意示图;Figure 10 is a schematic diagram illustrating another example embodiment of the unstructured data analysis system of the present invention;
图11是示出本发明的非结构化数据分析系统的呈现层的一个示例实施例的示意示图;Figure 11 is a schematic diagram illustrating an example embodiment of the presentation layer of the unstructured data analysis system of the present invention;
图12是示出本发明的非结构化数据分析系统的探索层的一个示例实施例的示意示图;以及Figure 12 is a schematic diagram illustrating an example embodiment of the exploration layer of the unstructured data analysis system of the present invention; and
图13是示出本发明的非结构化数据分析系统的注释层的一个示例实施例的示意示图。Figure 13 is a schematic diagram illustrating an example embodiment of the annotation layer of the unstructured data analysis system of the present invention.
具体实施方式detailed description
双线工作,即文本分析模型和文本可视化技术是本发明的初步设计的主要灵感。然后提炼这些概念并且基于其进行扩展,下面将进行更详细的描述。The two-line work, namely the text analysis model and the text visualization technique is the main inspiration for the preliminary design of the present invention. These concepts are then refined and expanded upon, as described in more detail below.
文本处理中的第一重大进展是矢量空间模型(VSM)。在该模型中,文本被表示为高维度空间中的矢量,其中每个维度与文档内的一个独特术语相关联。VSM的一个公知示例是TF-IDF,其评估词语对于语料中的文档的重要程度。尽管VSM已经凭实践经验示出了它的有效性,但是它在捕获文档之间和文档内的统计结构方面存在众多固有缺点。The first major advance in text processing was the Vector Space Model (VSM). In this model, text is represented as vectors in a high-dimensional space, where each dimension is associated with a unique term within the document. A well-known example of a VSM is TF-IDF, which evaluates how important words are to documents in a corpus. Although VSM has empirically shown its effectiveness, it suffers from numerous inherent shortcomings in capturing statistical structure between and within documents.
为克服VSM的缺点,研究者已经引入了LSA,LSA是将术语文档矩阵降低到捕获了语料中的大多数变量的低得多的维度子空间的因素分析。尽管LSA克服了VSM的一些缺点,但是它也具有它的局限性。新的特征空间难于解释,原因在于每个维度是来自原始空间的一组词语的线性组合。To overcome the shortcomings of VSMs, researchers have introduced LSA, a factor analysis that reduces the term-document matrix to a much lower dimensional subspace that captures most of the variables in the corpus. Although LSA overcomes some of the shortcomings of VSM, it also has its limitations. The new feature space is difficult to interpret because each dimension is a linear combination of a set of words from the original space.
意识到LSA的限制后,研究者已经对文档建模提出了生成概率模型。例如,研究者已经引入了代表具有概率话题的词语和文档的内容的生成模型,而不是纯空间表示。这种表示的一种独特优势在于每个话题是可独立解释的,提供了基于用于挑选相关联术语的相干簇的词语的概率分布。LDA模型假设了由一组话题组成的隐含结构;通过下述方式产生每个文档:选择基于话题的分布,然后根据通过使用该分布选择的话题随机产生每个词语。例如,如通过分析科学摘要和报纸档案所示,所提取出的话题捕获其他非结构化数据中的有含义的结构。在认知层面上,LDA模型在各种语言处理和存储任务上的预测词语关联以及语义关联和模糊的效果方面表现良好。Realizing the limitations of LSA, researchers have proposed generative probabilistic models for document modeling. For example, researchers have introduced generative models that represent the content of words and documents with probabilistic topics, rather than purely spatial representations. A unique advantage of this representation is that each topic is independently interpretable, providing a probability distribution based on words for picking coherent clusters of associated terms. The LDA model assumes an implicit structure consisting of a set of topics; each document is generated by choosing a topic-based distribution, and then randomly generating each word based on the topics selected using that distribution. For example, the extracted topics capture meaningful structure in otherwise unstructured data, as shown by analyzing scientific abstracts and newspaper archives. At the cognitive level, LDA models perform well in predicting word associations and the effects of semantic associations and ambiguities on a variety of language processing and storage tasks.
由于LDA模型的各种优点,本发明的方法和系统首先使用该模型来提取给定文本语料的一组语义上有含义的话题。本发明的方法和系统然后用直观方式呈现概率结果,以使得当分析大文本语料时,用户可容易地消费复杂模型。Due to the various advantages of the LDA model, the method and system of the present invention first use the model to extract a set of semantically meaningful topics for a given text corpus. The method and system of the present invention then presents the probabilistic results in an intuitive manner so that complex models can be easily consumed by users when analyzing large text corpora.
除了自动文本处理技术中的优点之外,人工智能在分析文本语料时仍发挥关键作用。因此,已经开发了基于文本处理方法的大量可视化系统和技术,以保持用户在进程中。In addition to the advantages in automatic text processing technology, artificial intelligence still plays a key role in analyzing text corpora. Therefore, a large number of visualization systems and techniques based on text processing methods have been developed to keep users in the process.
例如,使用VSM,已经引入了工具以使电子邮件内容可视化,目的在于根据会话历史来描绘关系。基于TF-IDF算法产生可视化内的关键词。For example, using VSM, tools have been introduced to visualize email content with the aim of delineating relationships in terms of conversation history. The keywords in the visualization are generated based on the TF-IDF algorithm.
其他工具使用户能够基于隐含语义分析结果通过社交网络隐喻可视化地探索文本语料。其他可视化系统已经使用了多维投影方法(例如主要成分分析(PCA)和/或多维尺度(MDS))以使文本语料可视化。这些投影技术与LSA精神上类似,由于它们将文本表示为将术语频率作为它们特征的矢量,然后识别较低维度投影空间。可视化系统因此基于这些包括IN-SPIRE在内的投影技术。最近,为了使大的分类文档集合可视化,其他人已经提出了用于基于拓扑的投影和可视化工具的二级框架。然而,与将文档指派给特定簇的大多数传统聚类技术不同,本发明的方法和系统考虑每个单独文档的不同话题方面。Other tools enable users to visually explore text corpora through social network metaphors based on latent semantic analysis results. Other visualization systems have used multidimensional projection methods such as principal component analysis (PCA) and/or multidimensional scaling (MDS) to visualize text corpora. These projection techniques are similar in spirit to LSA in that they represent text as vectors with term frequencies as their features, and then identify a lower dimensional projection space. The visualization system is thus based on these projection techniques including IN-SPIRE. More recently, others have proposed secondary frameworks for topology-based projection and visualization tools for visualizing large collections of classified documents. However, unlike most traditional clustering techniques that assign documents to specific clusters, the method and system of the present invention consider different topical aspects of each individual document.
自话题模型的初次亮相起,可视化系统已经由于这种模型相对于先前文本处理技术的优势而使用这些模型。基于范例的可视化和概率隐含语义可视化工具已经将文档投射到静态2D图表上,同时估计文本语料的话题。尽管可视化聚类结果比从多维投影方法获得的结果更好,但是其存在若干限制。首先,随着提取话题的数量增长,2D投影中的文档簇不再是可基于话题分离的。此外,在这些可视化工具中几乎不存在用于文档簇的交互挖掘和分析的空间。最近,已经引入了TIARA,即一种基于时间的交互式可视化系统,其以时间敏感的方式呈现从给定文本语料中提取出的话题。TIARA提供了关于话题随时间演进的对话题的良好概述。然而,文档和话题之间的关系不太清楚。Since the debut of topic models, visualization systems have used these models due to their advantages over previous text processing techniques. Example-based visualization and probabilistic latent semantic visualization tools have projected documents onto static 2D graphs while estimating the topics of text corpora. Although visual clustering results are better than those obtained from multidimensional projection methods, it has several limitations. First, as the number of extracted topics grows, document clusters in 2D projections are no longer separable based on topics. Furthermore, there is little room for interactive mining and analysis of document clusters in these visualization tools. Recently, TIARA, a time-based interactive visualization system that presents topics extracted from a given text corpus in a time-sensitive manner, has been introduced. TIARA provides a good overview of topics as they evolve over time. However, the relationship between documents and topics is less clear.
因此,本发明的方法和系统除了描述随时间发展的话题演进之外,还呈现了文档跨提取出的话题的概率分布。因此,本发明的方法和系统提供了基于它们的话题分布的文档特征的概述,并使用户能够识别出一次涉及多个话题的文档。Thus, the method and system of the present invention present the probability distribution of documents across the extracted topics in addition to describing topic evolution over time. Thus, the method and system of the present invention provide an overview of document characteristics based on their topic distribution and enable users to identify documents that address multiple topics at once.
本发明的方法和系统支持在多个层面上的对文档集合的探索。在概述层面上,系统辅助用户回答以下问题:文档集合的主要话题是什么?以及该集合中文档的特征是什么?在分面(facet)层面上,系统支持例如以下活动:识别特定话题的时间趋势,以及识别与多个关注话题相关的文档。在细节层面上,系统允许根据需要访问每个单独文档的详细内容。基于最新的话题模型,系统采用多个协调视图,每个视图解决上述问题之一。The method and system of the present invention support exploration of document collections at multiple levels. At an overview level, the system assists the user in answering the following questions: What are the main topics of the document collection? And what are the characteristics of the documents in that collection? At the facet level, the system supports activities such as identifying temporal trends for a particular topic, and identifying documents related to multiple topics of interest. At the detail level, the system allows access to the detailed content of each individual document as needed. Based on a state-of-the-art topic model, the system employs multiple coordination views, each addressing one of the above-mentioned problems.
现在具体参照图1,在一个示例实施例中,本发明的可视化文本语料分析工具10的整体结构包括:离线文本预处理12和话题建模模块14。文本预处理模块12可操作用于将相关文档16的文本置于合适条件下以供后续处理、探索和分析。这种文本预处理可以包括但不限于对来自社交媒体(例如,Twitter张贴和Facebook简档)、书籍(例如,来自Gutenberg在线图书项目的文献)和其他文档(例如,电子邮件、Word文档等)的文本的预处理。Now specifically referring to FIG. 1 , in an exemplary embodiment, the overall structure of a visual text corpus analysis tool 10 of the present invention includes: an offline text preprocessing 12 and a topic modeling module 14 . The text pre-processing module 12 is operable to condition the text of the relevant document 16 for subsequent processing, exploration and analysis. Such textual preprocessing may include, but is not limited to, textual processing from social media (e.g., Twitter posts and Facebook profiles), books (e.g., documents from the Gutenberg online book project), and other documents (e.g., emails, Word documents, etc.) text preprocessing.
如上所述,话题模型相对于传统文本处理技术具有若干优点。因此,本发明的可视化文本语料分析工具10利用话题建模模块14中的概率话题模型来总结相关文档16。更具体地,LDA被用于首先提取一组语义上有意义的话题。LDA产生一组隐含话题,每个话题被表示为基于关键词的多项分布,并假设每个文档可以被描述为这些话题的概率混合。P(z)是特定文档中基于话题z的分布。假设文本集合16包括D个文档和T个话题。确定话题是使用可视化文本语料分析工具10的迭代过程。该工具10使用户能够交互地指定多个话题被视为在它们的分析域中是必要的。允许用户基于来自他们的可视化交互和调查的发现来修改话题建模模块14,使得它们能够修改话题的数量和/或定义过程的迭代数量。可视化文本语料分析工具10还使用户能够向话题建模模块14添加、移除及合并话题。As mentioned above, topic models have several advantages over traditional text processing techniques. Accordingly, the visual text corpus analysis tool 10 of the present invention utilizes the probabilistic topic model in the topic modeling module 14 to summarize relevant documents 16 . More specifically, LDA is used to first extract a set of semantically meaningful topics. LDA produces a set of latent topics, each represented as a multinomial distribution over keywords, and assumes that each document can be described as a probabilistic mixture of these topics. P(z) is the distribution over topics z in a particular document. Assume that the text collection 16 includes D documents and T topics. Determining topics is an iterative process using the visual text corpus analysis tool 10 . The tool 10 enables a user to interactively specify a number of topics to be considered necessary in their domain of analysis. Users are allowed to modify the topic modeling module 14 based on findings from their visualization interactions and investigations, enabling them to modify the number of topics and/or the number of iterations of the definition process. The visual text corpus analysis tool 10 also enables users to add, remove, and merge topics to the topic modeling module 14 .
因此,文档集合16首先被预处理以移除禁用词等。然后,斯坦福话题建模工具箱(STMT)等被用于从文档集合16中提取话题集合。提取的话题和概率文档分布充当另外的可视化的输入。Therefore, the document collection 16 is first preprocessed to remove stop words and the like. Then, the Stanford Topic Modeling Toolbox (STMT) or the like is used to extract the topic set from the document collection 16 . The extracted topic and probabilistic document distributions serve as input for additional visualizations.
本发明的工具10的可视化设计包括四个协调概述,其可被单独地或组合地在合适的图形用户界面(GUI)上显示并操作:(1)显示文档跨话题的概率分布的文档分布视图18;(2)呈现提取的话题的内容的话题云20;(3)突出话题的时间演进的时间视图22;以及(4)促进单话题相对于多话题文档的交互选择的文档散点图视图24。四个概述中的每一个服务于不同的目的,并且它们通过一组丰富的用户交互来协调。此外,在选择任意文档时,详细视图根据需要呈现那个文档的文本内容。The visual design of the tool 10 of the present invention includes four coordination summaries, which may be displayed and manipulated individually or in combination on a suitable graphical user interface (GUI): (1) a document distribution view showing the probability distribution of documents across topics 18; (2) a topic cloud 20 presenting the content of extracted topics; (3) a temporal view 22 highlighting the temporal evolution of topics; and (4) a document scattergram view facilitating interactive selection of single-topic versus multi-topic documents twenty four. Each of the four overviews serves a different purpose, and they are coordinated through a rich set of user interactions. Additionally, when any document is selected, the detail view renders the textual content of that document as needed.
为了帮助用户快速抓住文档集合的要点,在话题云视图20中将主要话题呈现为标签云。在话题云视图20中,每行显示一个话题,其例如包括与那个话题相关的多个关键词。由于每个话题被建模为基于关键词的多项分布,每个关键词的权重指示了它对于话题的重要性。为了在标签云中封装这种信息,从左向右对齐关键词,其中在开始处放置最重要的关键词。此外,由于一个关键词可以在多个话题中出现,每个关键词的显示尺寸或权重反映了它在所有话题内的出现情况。然而,本领域技术人员将显然可以使用其他配置。图2中提供话题云视图20的示例。为了辅助用户理解文档集合16中的主要话题,在序列中呈现话题,使得语义上相似的话题紧靠在一起,使得当依次浏览话题时存在连续性。由于LDA模型不对话题之间的关系建模,通过定义相似性度量来对话题重新排序。可视化文本语料分析工具10利用林格(Hellinger)距离函数来表征表示话题的接近程度的相似性度量。可视化文本语料分析工具10使相似度度量可视化,以向用户提供对话题分布的语义层的理解,并通过对话题空间聚类来帮助减小它们的认知超载。In order to help the user quickly grasp the key points of the document collection, the main topics are presented as tag clouds in the topic cloud view 20 . In the topic cloud view 20, each row displays a topic, which includes, for example, a plurality of keywords related to that topic. Since each topic is modeled as a multinomial distribution based on keywords, the weight of each keyword indicates its importance to the topic. To encapsulate this information in the tag cloud, keywords are aligned from left to right, with the most important keywords placed at the beginning. In addition, since a keyword can appear in multiple topics, the display size or weight of each keyword reflects its occurrence in all topics. However, it will be apparent to those skilled in the art that other configurations can be used. An example of a topic cloud view 20 is provided in FIG. 2 . To assist users in understanding the main topics in the document collection 16, topics are presented in a sequence such that semantically similar topics are close together so that there is continuity when browsing topics in turn. Since the LDA model does not model the relationship between topics, topics are reordered by defining a similarity measure. The visual text corpus analysis tool 10 utilizes the Hellinger distance function to characterize a similarity measure that represents the closeness of topics. The visual text corpus analysis tool 10 visualizes similarity measures to provide users with an understanding of the semantic level of topic distribution and help reduce their cognitive overload by clustering topic spaces.
话题云视图20还为用户提供一组交互以帮助用户快速理解话题。例如,在特定关键词上悬停将使得对标签云中那个关键词的所有其他出现进行突出显示。用户还可以搜索关注的特定关键词。此外,话题云视图20与所有其他视图紧密协作以根据需要迅速提供关于特定话题的信息。The topic cloud view 20 also provides a set of interactions for the user to help the user quickly understand the topic. For example, hovering over a particular keyword will cause all other occurrences of that keyword in the tag cloud to be highlighted. Users can also search for specific keywords of interest. Furthermore, the topic cloud view 20 works closely with all other views to quickly provide information on specific topics as needed.
部分通过在线关键词加权模块26产生话题云视图20,在线关键词加权模块26可操作用于聚合话题建模模块的结果。它基于词语在给定话题中的概率来对该给定话题中的词语进行分类,更可能的词语将被放在分类队列的顶部。用话题建模模块14计算出的值标记该概率值。例如,通过该词在整个文本语料中的出现频率来确定该词语在话题云视图中的尺寸,并基于最大词频率进行归一化。例如,频率越高,词语越大。例如,工具10默认表示每个话题的最有可能的50个词。用户可通过交互修改词语的数量。The topic cloud view 20 is produced in part by an online keyword weighting module 26 operable to aggregate the results of the topic modeling modules. It classifies words in a given topic based on their probability in that topic, and more likely words will be placed at the top of the classification queue. This probability value is labeled with the value calculated by the topic modeling module 14 . For example, the word's frequency in the entire text corpus is used to determine the size of the word in the topic cloud view, and normalized based on the maximum word frequency. For example, the higher the frequency, the larger the word. For example, tool 10 defaults to representing the most likely 50 words for each topic. The number of words can be modified interactively by the user.
为了将文档的概述提供为话题的混合,本发明的工具10突出每个文档跨所有提取出的话题的分布。所选表示将文档概率分布转化为表示每个文档的类信号状样式。更具体地,采用并行坐标隐喻,其中每个轴表示一个话题并且每条线表示集合16中的一个文档。在图3中说明了该点。在该所选配置中,所有变量(即话题)均匀间隔并且每个变量共享从0到1的相同值范围。因此,当查看文档分布视图18时,不必要基于文档在每个单独轴上的值来理解文档,而是可以基于整体地在所有轴上的样式来理解文档。然而,本领域技术人员将明显可以使用其他配置。In order to provide an overview of documents as a mix of topics, the tool 10 of the present invention highlights the distribution of each document across all extracted topics. The chosen representation transforms the document probability distribution into a signal-like style representing each document. More specifically, a parallel coordinates metaphor is employed, where each axis represents a topic and each line represents a document in the collection 16 . This is illustrated in FIG. 3 . In this chosen configuration, all variables (ie topics) are evenly spaced and each variable shares the same range of values from 0 to 1 . Thus, when viewing the document distribution view 18, it is not necessary to understand a document based on its value on each individual axis, but rather it can be understood based on its pattern on all axes as a whole. However, it will be apparent to those skilled in the art that other configurations can be used.
LDA的一种限制在于它不直接对话题出现之间的互相关性建模,但在大多数文本语料中,很自然地会预期到话题出现之间的互相关性。本发明的工具10通过使话题之间的互相关更突出来使用可视化克服该限制。巧合的是,并行坐标可视化的一个特征在于更容易发现相邻轴之间的关联。因此,可以用使得语义类似的话题彼此相邻的方式对话题排序,使得类似话题之间的关联变得可视化地突出。该话题相似性是根据在全部文档16中两个话题间的欧氏距离来定义的:One limitation of LDA is that it does not directly model cross-correlations between topic occurrences, which is naturally expected in most text corpora. The tool 10 of the present invention overcomes this limitation using visualization by making the cross-correlations between topics more prominent. Coincidentally, a feature of parallel coordinate visualization is that it is easier to spot associations between adjacent axes. Thus, topics can be ordered in such a way that semantically similar topics are adjacent to each other, so that associations between similar topics become visually prominent. The topic similarity is defined in terms of the Euclidean distance between two topics in all documents16:
其中dk是整个集合16中的D个文档之一,并且P(dk)是第k个文档在全部话题上的概率分布。因此,P(dk|z=i)表示在生成文档k时话题i的概率。当将话题绘制为所选界面中的轴时,以概率最集中的话题开始并然后基于话题间的距离查找与当前话题最类似的话题。图3说明在话题重新排序之后跨话题的文档可视化。任意两个最类似的话题之间的关系(即在相邻轴上)变得可视化地可识别。where d k is one of the D documents in the entire set 16, and P(d k ) is the probability distribution of the kth document over all topics. Therefore, P(d k |z=i) represents the probability of topic i when document k is generated. When plotting topics as an axis in the selected interface, start with the topic with the highest concentration of probability and then find the topic most similar to the current topic based on the distance between topics. Figure 3 illustrates document visualization across topics after topic reordering. The relationship (ie, on the adjacent axis) between any two most similar topics becomes visually identifiable.
部分通过在线话题排序模块28产生文档分布视图18,该在线话题排序模块28可操作用于执行上述功能以及单独文档的信号表示。这种信号是对文档的不同性质的说明。视图18示出在单个话题上具有显著分布的文档非常关注特定主题,然而具有在2或3个话题上的分布的文档指示可变动的焦点。The document distribution view 18 is generated in part by an online topic ranking module 28 operable to perform the functions described above as well as signal representations of individual documents. This signal is an indication of the different nature of the document. View 18 shows that documents with a significant distribution on a single topic are very focused on a particular topic, whereas documents with a distribution on 2 or 3 topics indicate a variable focus.
当探索文档在话题上的分布时,可以容易地基于它们具有的话题数发现给定文档呈现不同特征。图4示出了关注仅一个话题的文档30、两个话题的文档32和多于两个话题的文档34。文档内的不同的话题数量可以解释为给定文档集合16的上下文下的不同特征。例如,在科学出版物的集合中,具有一个话题的文档表示与特定的科学研究领域相关的出版物。具有两个或更多个话题的文档更可能表示跨学科的研究文章,其通常整合两个或更多个专业知识体。When exploring the distribution of documents over topics, it is easy to find that given documents exhibit different characteristics based on the number of topics they have. Figure 4 shows documents 30 focusing on only one topic, documents 32 on two topics, and documents 34 on more than two topics. Different amounts of topics within documents can be interpreted as different characteristics within the context of a given collection of documents 16 . For example, in a collection of scientific publications, documents with a topic represent publications related to a particular field of scientific research. Documents with two or more topics are more likely to represent interdisciplinary research articles, which typically integrate two or more specialized bodies of knowledge.
此外,文档分布视图18提供了丰富的交互集合,例如刷、高亮等。刷话题上一定比例的范围允许用户选择针对那个特定话题具有特定概率的文档。通过综合来自话题云视图20和文档分布视图18二者的与主话题相关的信息和文档特征,用户能够有效地开发对文档集合16的概述。In addition, the document distribution view 18 provides a rich set of interactions, such as swiping, highlighting, and the like. Brushing a range of percentages on a topic allows the user to select documents that have a certain probability for that particular topic. By combining information and document characteristics related to the main topic from both the topic cloud view 20 and the document distribution view 18 , the user can efficiently develop an overview of the document collection 16 .
文档分布视图18使用户能够通过刷话题上的上部范围来识别关注特定话题的文档。然而,在大语料中识别与两个或更多个话题相关的文档不那么容易,原因在于它们被高概率值的单个话题文档所掩盖。为了缓解该问题,用可以容易分开单话题文档和多话题文档的方式来分离全部文档。这是文档散点图视图24。The document distribution view 18 enables the user to identify documents of interest to a particular topic by swiping the upper range on the topic. However, it is not so easy to identify documents related to two or more topics in a large corpus because they are masked by single topic documents with high probability values. To alleviate this problem, all documents are separated in such a way that single-topic documents and multi-topic documents can be easily separated. This is document scatterplot view 24.
如文档分布视图18中可以看出,每个文档被转换为类信号样概率分布样式。在该表示中,具有多话题的文档表现得比那些明确关注一个话题的文档噪声更大。在信息论中,香农熵是与随机变量相关联的不确定度的量的度量。假设话题是针对我们上下文中每个文档的随机变量,香农熵可以用于将干净信号与噪声信号区分开。因此,本发明的工具10应用香农熵来基于文档具有的话题数来区分文档。每个文档的基于它跨话题的概率分布的熵被计算为:As can be seen in the document distribution view 18, each document is converted into a signal-like probability distribution pattern. In this representation, documents with multiple topics appear noisier than those explicitly focused on one topic. In information theory, Shannon entropy is a measure of the amount of uncertainty associated with a random variable. Assuming topics are random variables for each document in our context, Shannon entropy can be used to distinguish clean signals from noisy ones. Therefore, the tool 10 of the present invention applies Shannon entropy to differentiate documents based on the number of topics they have. The entropy of each document based on its probability distribution across topics is computed as:
其中P(dk)是第k个文档在全部话题上的概率分布。然后可以在文档散点图视图24中基于每个文档的熵和它在话题上的最大概率值(归一化为[0,1])来绘制每个文档(参见图5)。在该呈现中,例如,单话题(具有较高最大值和较低熵)文档处于散点图的左上角,而右下角捕获具有较高话题数量(具有较低最大值和较高熵)的文档。在选择时,示出饼图来描述特定文档的话题分布。在图5中,每个饼图表示所选文档,其中每个颜色表示一个话题。如所示,具有较小熵值的文档表现为实线圆的饼图;而具有较大熵值的文档表现为具有多颜色,指示熵值与输入文档中的话题数相对应。where P(d k ) is the probability distribution of the kth document on all topics. Each document can then be plotted in the document scatterplot view 24 based on its entropy and its maximum probability value on topic (normalized to [0,1]) (see FIG. 5 ). In this presentation, for example, single-topic (with higher maximum value and lower entropy) documents are in the upper left corner of the scatterplot, while the lower right corner captures documents with a higher number of topics (with lower maximum value and higher entropy) document. When selected, a pie chart is shown to describe the topic distribution of a particular document. In Figure 5, each pie chart represents selected documents, where each color represents a topic. As shown, documents with smaller entropy values appear as pie charts with solid circles; whereas documents with larger entropy values appear as multicolored, indicating that the entropy values correspond to the number of topics in the input documents.
总之,文档散点图视图24使用户能够通过对不同区域内文档的选择来交互式地识别具有所需数量话题的文档的子组。部分通过文档熵计算模块36产生文档散点图视图24,该文档熵计算模块36可操作用于执行上述功能以及对任意给定文本语料中的文档的分组。文档散点图视图24有意地基于文档的熵对文档分组,并可视化地说明在那个给定语料上的关注,暗示那个语料是关注单个主题还是可变动的主题。In summary, the document scattergram view 24 enables the user to interactively identify subgroups of documents having a desired number of topics through selection of documents in different areas. The document scattergram view 24 is generated in part by a document entropy calculation module 36 operable to perform the functions described above as well as grouping documents in any given text corpus. The document scatterplot view 24 intentionally groups documents based on their entropy and visually illustrates the focus on that given corpus, implying whether that corpus focuses on a single topic or variable topics.
由于大多数文档集合16随时间累积,呈现这种时间信息有助于辅助用户理解语料的话题如何演进。现在具体参照图6,时间视图22被创建为交互式河流图(ThemeRiver),其中每个带表示一个话题。在文本语料中,每个文档与时间戳相关联,因此可以通过对每个时间帧内文档在该话题上的分布加和来确定每个带随时间的高度。时间帧的单位取决于语料,例如,一年可能是对于科学出版物的合适时间单位,而一个月或甚至一天对于新闻语料将更合适。在已经选择时间单位之后,文档基于时间戳被划分到相应时间帧。然而,针对每个时间帧,通过在该时间帧内对来自文档的话题的分布加和来计算每个话题的高度。Since most collections of documents 16 accumulate over time, presenting this temporal information helps to assist users in understanding how the topics of the corpus evolve. Referring now specifically to FIG. 6, the time view 22 is created as an interactive ThemeRiver, with each band representing a topic. In a text corpus, each document is associated with a timestamp, so the height of each band over time can be determined by summing the distribution of documents on that topic within each time frame. The units of the time frame depend on the corpus, for example a year might be an appropriate time unit for scientific publications, while a month or even a day would be more appropriate for a news corpus. After the time unit has been selected, the documents are divided into corresponding time frames based on the timestamps. However, for each time frame, the height of each topic is calculated by summing the distribution of topics from the document over that time frame.
例如,在话题云视图20和文档分布视图18二者中,话题的顺序(从上到下)相同。通过在所有相邻话题之间使用归一化距离,通过内插颜色或样式频谱,来为话题指派颜色或样式。结果,越类似的一对话题被指派越相似的颜色或样式。For example, the order of topics (top to bottom) is the same in both the topic cloud view 20 and the document distribution view 18 . Topics are assigned a color or style by interpolating the color or style spectrum using the normalized distance between all adjacent topics. As a result, a more similar pair of topics is assigned a more similar color or style.
总之,时间视图22提供文档集合16的话题如何随时间演进的可视化总结。除该表示以外,时间视图22内还支持各种交互。对时间帧(一个垂直时间单位)的选择导致对所选时间帧内发布的所有文档的过滤。类似地,例如,在时间视图22中的话题带和时间帧的交叉点上点击导致对在该时间帧期间发布的在所选话题上具有大于30%概率的文档的选择。因此,可以识别在特定时间段中什么文档对话题的产生做出了共享。时间视图22通过揭示文档集合16中隐藏的时间信息并允许用户基于时间和话题执行过滤来增加了丰富性。In summary, time view 22 provides a visual summary of how the topics of document collection 16 evolve over time. In addition to this representation, various interactions are supported within the time view 22 . Selection of a time frame (one vertical unit of time) results in filtering of all documents published within the selected time frame. Similarly, for example, clicking on the intersection of a topic band and a time frame in time view 22 results in the selection of documents published during that time frame that have a greater than 30% probability on the selected topic. Thus, it is possible to identify what documents contributed to the generation of the topic during a certain time period. Time view 22 adds richness by revealing hidden time information in document collection 16 and allowing users to perform filtering based on time and topic.
部分通过时间话题趋势计算模块38产生时间视图22,该时间话题趋势计算模块38可操作用于执行上述功能以及对详细文档的检查。时间视图22使用户能够直接选择例如在特定时间范围内的文档并取得相应的数据。时间视图22通过揭示与这种描绘相关联的文档细节在向用户示出识别出的可视化样式和趋势的基础中起关键作用。The temporal view 22 is generated in part by a temporal topic trend calculation module 38 operable to perform the functions described above as well as inspection of detailed documents. The time view 22 enables the user to directly select, for example, documents within a certain time range and retrieve corresponding data. The temporal view 22 plays a key role in showing the user the basis for identified visual patterns and trends by revealing document details associated with such depictions.
在选择任意文档时,本发明的工具10提供关注的文档的实际文本内容的细节。由于任何话题模型都远非完美,详细视图的功能是双重的:首先,它为用户提供上下文以开展对话题和话题关联的关键词的深度理解;其次,它帮助用户验证可视化中示出的样式。Upon selection of any document, the tool 10 of the present invention provides details of the actual textual content of the document in question. Since any topic model is far from perfect, the function of the detailed view is twofold: first, it provides the user with context to develop a deep understanding of the topic and the keywords associated with the topic; second, it helps the user verify the patterns shown in the visualization .
由于理解大文本语料16会涉及对全部四个视图的利用,需要仔细琢磨所有视图之间的协调。在话题层面上,在涉及话题表示的任意视图中的话题上悬停将在其他视图中突出显示相同话题。例如,如果用户在文档分布视图18中的一个轴上悬停,则在话题云视图20和时间视图22二者中突出显示相同话题。因此,用户能够快速综合关于特定话题的关键词、文档分布和时间趋势的信息。此外,还通过颜色或样式协调视图,其中每个话题在全部视图中具有相同颜色或样式。Since understanding a large text corpus16 involves the utilization of all four views, the coordination between all views needs to be carefully considered. At the topic level, hovering over a topic in any view that involves topic representation will highlight the same topic in other views. For example, if the user hovers over one axis in the document distribution view 18, the same topic is highlighted in both the topic cloud view 20 and the time view 22. Therefore, users are able to quickly synthesize information about keywords, document distribution, and temporal trends of a specific topic. Additionally, views are coordinated by color or style, where each thread has the same color or style across all views.
在文档层面上,在包括各个文档的视图中选择任意文档集合将在其他视图中突出显示同一文档集合。例如,在文档散点图视图20中的刷操作立即反映在文档分布视图18中,并且反之亦然。当用户在文档散点图视图24中选择具有两个突出话题(即中间范围)的几个文档时,查看这些文档的分布帮助用户理解文档的话题组合。At the document level, selecting any collection of documents in a view that includes individual documents will highlight the same collection of documents in other views. For example, swiping in the document scatter view 20 is immediately reflected in the document distribution view 18, and vice versa. When a user selects several documents in the document scattergram view 24 that have two salient topics (ie, middle ranges), viewing the distribution of these documents helps the user understand the topic combinations of the documents.
关于时间方面,支持对特定时间段内书写/发布的文档的过滤。例如,在时间视图22中在一时间帧(即一个垂直时间单位)上的点击导致对所选时间跨度内发布的所有文档的过滤。类似地,在时间视图22中的话题带和时间帧的交叉点上的点击导致对在该时间段期间发布的下述文档的选择:那些文档具有的话题对那些文档占主要贡献(例如,大于30%的概率)。在文档分布视图18和文档散点图视图24二者中示出这种选择。该功能允许用户基于关注的时间和话题对文档过滤,并且然后检查在所选时间帧内发布的文档。Regarding time, it supports filtering of documents written/published within a specific time period. For example, clicking on a time frame (ie, one vertical time unit) in the time view 22 results in a filter of all documents published within the selected time span. Similarly, clicking on the intersection of a topic band and a time frame in the time view 22 results in a selection of documents published during that time period that have topics that contribute significantly to those documents (e.g., greater than 30% probability). This selection is shown in both the document distribution view 18 and the document scatter diagram view 24 . This feature allows users to filter documents based on time and topic of interest, and then examine documents published within a selected time frame.
本发明的工具10允许用户从多个视点探索并查询大文档语料16。从话题云视图20开始,用户可以查看语料16的总结并识别关注的话题甚或关键词。根据文档分布视图18,用户可以定位关注的话题并通过在垂直轴上进行刷操作来选择关注该话题的文档。用户然后可以通过查看文档分布视图18和文档散点图视图24中的分布,可视化地识别所选的文档集合与哪些其他话题相关。此外,用户总是可以基于选择检查文档的细节。例如,如果用户想要识别语料16中的跨学科/多学科出版物,他/她被配备为在文档散点图视图24中通过选择中间至右下角的文档来实现这一点。此外,如果用户对于通过时间因子查询语料16感兴趣,则他/她可以通过在一个时间帧上进行点击或者在特定时间帧和话题的交叉点上进行点击来在时间视图22中执行选择。总之,本发明的工具10采用多个协调视图来支持文本语料16的交互探索。视图中的每一个被设计为解决四个重要问题中的一个。The tool 10 of the present invention allows a user to explore and query a large document corpus 16 from multiple viewpoints. Beginning with the topic cloud view 20, the user can view a summary of the corpus 16 and identify topics or even keywords of interest. According to the document distribution view 18, the user can locate a topic of interest and select a document of interest to the topic by performing a swiping operation on the vertical axis. The user can then visually identify which other topics the selected collection of documents are related to by viewing the distributions in the document distribution view 18 and the document scatter diagram view 24 . Furthermore, the user can always check the details of the document based on the selection. For example, if a user wants to identify interdisciplinary/multidisciplinary publications in the corpus 16 , he/she is equipped to do so in the document scatter diagram view 24 by selecting the middle to bottom right document. Furthermore, if the user is interested in querying the corpus 16 by the time factor, he/she can perform selections in the time view 22 by clicking on a time frame or on the intersection of a specific time frame and topic. In summary, the tool 10 of the present invention employs multiple coordinated views to support interactive exploration of the text corpus 16 . Each of the views is designed to address one of four important questions.
为了评估本发明的工具10在回答四个目标问题方面的效率,工具10被应用于探索并分析两个文本语料,这两个文本语料包括从2006年2010年的国家自然基金(NSF)授予的科学提案和IEEE VAST论文集中的出版物。In order to evaluate the efficiency of the tool 10 of the present invention in answering the four target questions, the tool 10 was applied to explore and analyze two text corpora including the National Natural Science Foundation of China (NSF) awarded from 2006 to 2010 Scientific proposals and publications in IEEE VAST Proceedings.
案例研究1。分析科学提案。在该案例研究中,我们首先描述我们采集的数据。然后我们表征目标域并展现基于我们与NSF的项目管理者的对话总结的一组任务。最后,我们展现工具可以如何辅助专家用户解决这些任务。Case study 1. Analyzing Science Proposals. In this case study, we first describe the data we collected. We then characterize the target domain and present a set of tasks based on our conversations with NSF program managers. Finally, we show how tools can assist expert users in solving these tasks.
数据收集和准备。为了检验该工具是否可以辅助项目管理者做出资金决定并管理授予投资结构,我们首先收集作为计算机与信息科学与工程(CISE)董事会一部分的信息与智能系统(IIS)部门从2000年至2010年授予的提案。该集合由接近4000个授予组成,其中具有关于授予号、董事会、部门、项目、项目管理者、主要研究员和授予日期的结构化数据;以及具有非结构化文本的形式的提案摘要。我们处理所有收集的摘要,其中每个摘要构成语料中的单个文档。我们移除标准禁用字的列表。这给了我们334,447个词语的词汇量。我们然后使用LDA模型从语料中提取出30个话题。Data collection and preparation. To examine whether this tool can assist program managers in making funding decisions and managing awarding investment structures, we first collected information and intelligent systems (IIS) departments that are part of the Computer and Information Sciences and Engineering (CISE) Board of Directors from 2000 to 2010 Awarded proposal. The collection consists of close to 4000 grants with structured data on grant number, board, department, program, program manager, principal investigator, and grant date; and proposal summaries in the form of unstructured text. We process all collected abstracts, where each abstract constitutes a single document in the corpus. We remove the list of standard stop words. This gives us a vocabulary of 334,447 words. We then use the LDA model to extract 30 topics from the corpus.
域刻画。NSF的使命的核心部分是:通过对传统学术领域中的研究提供资金(包括识别更广泛的影响),以及向可变动和跨学科的研究提供资金,保持美国处于发现前沿,。为实现前者,NSF的项目管理者需要识别合适的评审者和小组成员以确保最佳可能的同行评审。为了有效地执行后者,项目管理者需要识别新兴的领域和研究话题,以便为跨学科和可变动的研究提供资金。除了做出投资决定之外,项目管理者还需要管理他们的授予投资结构。尽管项目管理者在过去已经做得很好,但是他们需要新方法来帮助他们,原因在于科学的自然快速变化的特性和递交的提案数量的显著增长。将高级任务映射到可执行项,我们设计了与决策和授予投资结构相关的三个任务。任务1关注基于新提案的话题将新提案提交分组。该任务需要理解文本语料的主要话题,并基于它们相对于话题的特征来过滤子文档集合。任务2是识别针对提案提交的合适评审者,其还涉及知晓提交是否与多个话题相关以收集正确的专家。最后,任务3关注涉及发现随时间发展的话题趋势的授予资金结构的时间方面。domain characterization. A core part of NSF's mission is to keep the United States at the forefront of discovery by funding research in traditional academic fields, including identifying broader implications, and by funding research that is transformative and interdisciplinary. To achieve the former, NSF program managers need to identify appropriate reviewers and team members to ensure the best possible peer review. To effectively execute the latter, program managers need to identify emerging areas and research topics in order to fund interdisciplinary and variable research. In addition to making investment decisions, project managers also need to manage their awarded investment structures. Although program managers have done well in the past, they need new ways to help them because of the naturally fast-changing nature of science and the dramatic increase in the number of proposals submitted. To map high-level tasks to executables, we design three tasks related to the decision-making and awarding of investment structures. Task 1 focuses on grouping new proposal submissions based on their topics. The task requires understanding the main topics of a text corpus and filtering a collection of sub-documents based on their characteristics relative to topics. Task 2 is to identify suitable reviewers for a proposal submission, which also involves knowing whether the submission is related to multiple topics to gather the correct experts. Finally, Task 3 focuses on the temporal aspects of award funding structures involved in discovering topical trends over time.
专家评估。由于NSF的项目管理者特别繁忙,我们邀请了前NSF项目管理者进行我们的专家评估。参与者具有作为NSF的项目管理者的两年工作经验。在该评估的开始,我们花30分钟证明每个可视化的系统设计和功能。然后,我们要求参与者使用工具执行以下三个任务。Expert assessment. Because NSF program managers are particularly busy, we invited former NSF program managers to conduct our expert assessments. Participants have two years of work experience as a program manager at NSF. At the beginning of this assessment, we spend 30 minutes demonstrating the system design and functionality of each visualization. We then asked participants to use the tool to perform the following three tasks.
任务1。基于话题对200个新近递交的提案进行分组。从话题云视图开始,参与者快速浏览提取的话题以获得对新近递交提案的概览。由于参与者负责机器人学和计算机视觉领域的提案,她将她的注意力快速关注到这两个话题上。在选择关注关于机器人学的话题的提案时,参与者在详细视图中快速扫视标题以验证它们的相关性。尽管参与者确保每个选择的提案是相关的,她还注意到提案在文档散点图视图中的位置是分散的。由于在右下位置的提案更可能包括两个或更多个话题,参与者有兴趣知道这些提案还涉及哪些其他话题。通过在文档散点图视图中对那些看起来是更交叉学科的提案进行进一步过滤,参与者发现他们涉及例如神经科学和社会通信之类的其他领域。当在文档分布视图中选择相关文档时,调用详细视图使得项目管理者可以查看先前授予的PIs。Task 1. Group 200 newly submitted proposals based on topics. Starting from the topic cloud view, participants quickly browse through the extracted topics to get an overview of newly submitted proposals. Since the participants were responsible for proposals in the fields of robotics and computer vision, she quickly focused her attention on these two topics. When selecting proposals focused on a topic about robotics, participants quickly glanced at the titles in the detailed view to verify their relevance. Although the participant ensured that each selected proposal was relevant, she also noticed that proposals were scattered in their positions in the scatterplot view of the document. Since proposals in the bottom right position are more likely to include two or more topics, participants were interested to know what other topics these proposals covered. By further filtering in the document scatterplot view for those proposals that appeared to be more interdisciplinary, participants found that they related to other fields such as neuroscience and social communication. When the relevant document is selected in the document distribution view, the detailed view is invoked so that the project manager can view previously awarded PIs.
任务2。识别合适的评审者。为了识别评审者,参与者首先想将提案粗略地分组。基于初始探索,参与者总结大致存在两组提案:一组关注机器人学领域的核心,而另一组使用来自例如神经科学和社会通信之类的其他领域的知识体。为了识别两组提案的评审者,参与者想要从先前授予的提案中找到PIs。通过检查历史数据,项目管理者在文档分布视图中定位关于机器人学的话题。她然后在轴的顶部范围进行刷操作以选择与该话题有关的提案。最后,参与者转向详细视图以查看机器人学领域中先前授予的PI。针对组2中的跨学科提案,参与者经历类似的过程来识别来自其他相关领域(例如神经科学)的其他专家,以服务于评审图,确保最佳可能的同行评审。Task 2. Identify suitable reviewers. To identify reviewers, participants first wanted to roughly group proposals. Based on the initial exploration, participants concluded that there are broadly two sets of proposals: one focusing on the core of the field of robotics, and the other using body of knowledge from other fields such as neuroscience and social communication. To identify reviewers for two sets of proposals, participants want to find PIs from previously awarded proposals. By examining historical data, project managers locate robotics-related topics in the document distribution view. She then swipes across the top range of the axis to select proposals related to that topic. Finally, participants turn to the detailed view to view previously awarded PIs in the field of robotics. For interdisciplinary proposals in group 2, participants go through a similar process to identify other experts from other related fields (e.g. neuroscience) to serve on the review map, ensuring the best possible peer review.
任务3。分析授予投资结构的时间趋势。在投资结构层面上,前项目管理者有兴趣查看她负责的领域近年来的时间趋势。通过探索时间视图,参与者发现机器人学领域中授予的提案的趋势稳定,尽管在2006和2009年期间授予的提案的整体数量在增加。与机器人学的稳定趋势不同,在“使用技术帮助残疾人”的话题上授予的提案数量逐年增长。前项目管理者评论说,该视图对于她是有价值的,原因在于该视图使她能够查看用其他方式难以发现的关于不同话题的投资趋势。Task 3. Analyze time trends in vested investment structures. At the investment structure level, the former project manager is interested in looking at the time trends of her area of responsibility in recent years. Exploring the time view, participants found a steady trend in awarded proposals in the field of robotics, although the overall number of awarded proposals increased between 2006 and 2009. Unlike the steady trend in robotics, the number of proposals awarded on the topic of "Using Technology to Help People with Disabilities" has grown every year. A former project manager commented that this view was valuable to her because it allowed her to see investment trends on different topics that were otherwise difficult to spot.
总之,参与者认为工具中的每个视图是具有清楚目的良好设计的。她评论说,该工具可以在项目管理者的工作流程中起促进作用。具体地,她喜欢这一事实:我们的工具可以自动建议更交叉学科的提案,原因在于这用传统方式难于判断。她还喜欢视图之间的协作,这帮助她快速综合来自同一语料不同方面的信息。Overall, participants felt that each view in the tool was well designed with a clear purpose. She commented that the tool can be a facilitator in the workflow of project managers. In particular, she likes the fact that our tool can automatically suggest proposals that are more interdisciplinary, because this is difficult to judge in traditional ways. She also likes the collaboration between views, which helps her quickly synthesize information from different aspects of the same corpus.
案例研究2。分析VAST会议论文集。随着可视化分析领域的成熟,回顾该领域如何演进是有益的。解决该问题的一种方式是分析已被可视化分析中最重要的会场接受的出版物。在该案例研究中,我们招聘四个研究者来探索自从2006年该领域开始起在VAST会议/座谈会中发布的论文。由于所有用户都熟悉可视化分析领域,我们希望鼓励自由探索,这与下面的良好结构的任务相反。在评估之后,我们将参与者的发现归为两组:发现话题的时间演进与资金来源之间的因果关系,以及学习可视化分析领域中的令人感兴趣的子领域。case study 2. Analysis of VAST Conference Proceedings. As the field of visual analytics matures, it is instructive to review how the field has evolved. One way to address this issue is to analyze publications that have been accepted by the foremost venues in visual analytics. In this case study, we recruited four researchers to explore papers published at VAST conferences/symposiums since the field began in 2006. As all users are familiar with the field of visual analytics, we want to encourage free exploration, as opposed to the well-structured tasks below. Following the assessment, we grouped participants' findings into two groups: discovering causal relationships between temporal evolution of topics and funding sources, and learning interesting subfields within the field of visual analytics.
数据收集和准备。我们首先收集从2006年至2010年在VAST会议/座谈会中发布的全部论文。收集总共123个出版物。我们然后将每个出版物解析为包括标题、作者、发表年限、摘要、主体和参考文献的字段。我们对每篇文章的整个主体执行话题建模(从引言到结论),其中每篇文章构成语料中的一个文档。移除标准禁用字,给我们留下了317,315个词的词汇量。基于我们针对每个VAST会议的不同轨道的记录,我们从语料中提取了19个话题。Data collection and preparation. We first collect all papers published in VAST conferences/symposiums from 2006 to 2010. A total of 123 publications were collected. We then parse each publication into fields including title, author, year of publication, abstract, body, and references. We perform topic modeling (from introduction to conclusion) on the entire body of each article, where each article constitutes a document in the corpus. Removing the standard stop words leaves us with a vocabulary of 317,315 words. Based on our recordings for different tracks of each VAST conference, we extracted 19 topics from the corpus.
用户评估。在我们招聘的四个研究者中,两个是可视化分析领域中的高级研究员,而另两个是将可视化分析作为他们主要研究兴趣的博士生。在该评估中,我们为全部参与者提供高级任务并鼓励更自由的挖掘。在介绍该工具之后,我们要求每个参与者识别领域内的核心话题以及该领域在过去的5年间是如何演进的。我们将使用样式粗略地归为两组:识别上升的/衰落的话题,并使用该系统作为教育工具。user evaluation. Of the four researchers we recruited, two are senior researchers in the field of visual analytics, while the other two are PhD students who have visual analytics as their main research interest. In this evaluation, we provide high-level tasks for all participants and encourage freer mining. After introducing the tool, we asked each participant to identify core topics in the field and how the field has evolved over the past 5 years. We roughly categorize usage styles into two groups: identifying rising/decreasing topics, and using the system as an educational tool.
识别上升的/衰落的话题。在话题云视图中扫视过全部话题之后,一个高级研究员评论说:话题良好符合来自VAST会议的论文追踪。当查看每个话题的时间趋势时,参与者注意到几个清楚的上升和衰落的样式。例如,关于视频新闻分析的话题起初吸引了很多关注,但是关注迅速逐年减少。他还注意到在关于网络业务监测和分析的话题上的类似趋势。将该样式与他的知识相关联,参与者解释了所述趋势,因为当所述领域开始时,由作为那时的主要资金来源的国土安全部(DHS)引导了所述关注领域。接下来,参与者转向上升的样式,其指示了近年来产生的那些话题中的关注。具体地,自从2008年以来,话题趋势和不确定性分析以及话题维度分析和降低二者吸引了更多的关注。同样将所述样式与他自身的知识相关联,参与者评论说这很可能是由NSF和DHS联合引入的数据和可视化分析的基金会(FODAVA)项目的结果。Identify rising/declining topics. After scanning all topics in the topic cloud view, a senior researcher commented that the topic fits well with the paper track from the VAST conference. When looking at the time trends for each topic, participants noticed several clear rising and falling patterns. For example, the topic of video news analysis initially attracted a lot of attention, but the attention quickly dwindled year by year. He also noticed a similar trend on the topic of network traffic monitoring and analysis. Relating this pattern to his knowledge, the participant explained the trend because when the field started, the Department of Homeland Security (DHS), which was the main source of funding at that time, led the field of concern. Next, participants turned to rising patterns, which indicated concern among those topics that had arisen in recent years. Specifically, both topic trend and uncertainty analysis and topic dimension analysis and reduction have attracted more attention since 2008. Also relating the style to his own knowledge, the participant commented that this was likely the result of a Foundation for Data and Visual Analysis (FODAVA) project jointly introduced by NSF and DHS.
了解可视化分析的领域。另一高级研究员(其那时教授可视化分析课程)评论说:他可以看出该工具对于他的课程有用。学生可以探索全部VAST出版物,并识别与关注话题有关的论文以用于课程演示。类似地,另一参与者想查看在可视化分析领域中已经在文本分析方面做了什么。他首先定位话题,然后选择在文档分布视图中该话题上排名高的出版物。他在详细视图中快速扫视论文标题,并验证全部所选论文均满足他的兴趣。他还注意到该选择中的一些论文似乎与例如实体提取和数据库查询之类的其他话题相关。在该学习之后,他要求对详细视图的屏幕捕获,使得他能够查找他在该学习研究期间识别出的论文。Learn about the field of visual analytics. Another senior researcher (who was teaching a visual analytics course at the time) commented that he could see the tool being useful for his course. Students can explore the full range of VAST publications and identify papers related to topics of interest for use in course presentations. Similarly, another participant wanted to see what had been done in the field of visual analytics for text analytics. He starts by locating the topic, and then selects publications that rank highly on that topic in the document distribution view. He quickly scans the paper titles in the detailed view and verifies that all selected papers meet his interests. He also noticed that some of the papers in this selection seemed to be related to other topics such as entity extraction and database querying. After the study, he requests a screen capture of the detailed view, enabling him to look up the papers he identified during the study study.
总之,参与者认为该工具有助于帮助他们探索可视化分析领域的演进,并且基于他们自身兴趣识别出版物以供进一步调查。Overall, participants found the tool helpful in helping them explore evolutions in the field of visual analytics and identify publications based on their own interests for further investigation.
本领域技术人员将理解本发明的各种模块和过程是使用计算机等处理设备实现的。这种计算机等处理设备可以包括一个或更多个通用或专用处理器,例如微处理器、数字信号处理器、定制处理器和现场可编程门阵列(FPGA)、以及唯一存储的程序指令(包括软件和固件二者),其控制一个或更多个处理器,结合特定非处理器电路,实现本发明的方法和系统的功能中的一些、大多数或全部功能。备选地,一些或全部功能可以由不具有存储的程序指令的状态机或在一个或更多个专用集成电路(ASIC)中实现,在ASIC中每个功能或功能的一些组合被实现为定制逻辑。当然,可以使用上述方法的组合。此外,可以经由具有在其上存储的用于对计算机、服务器、电器、设备等编程的计算机可读代码的非瞬时性计算机可读存储介质来实现一些示例实施例,计算机、服务器、电器、设备等中的每一个可以包括处理器以执行本文描述和要求的功能。这种计算机可渎存储介质的示例包括但不限于:硬盘、光储存设备、磁存储设备、只读存储器(ROM)、可编程只读存储器(PROM)、可擦写可编程只渎存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、闪存等。当在非瞬时性计算机可读介质中存储时,软件可以包括可以由处理器执行的指令,处理器响应于这种执行,使处理器和/或任意其他电路执行一组操作、步骤、方法、过程、算法等。Those skilled in the art will understand that various modules and processes of the present invention are implemented using processing devices such as computers. Such processing devices, such as computers, may include one or more general or special purpose processors, such as microprocessors, digital signal processors, custom processors, and field programmable gate arrays (FPGAs), and uniquely stored program instructions (including Both software and firmware) that control one or more processors, in combination with specific non-processor circuitry, carry out some, most, or all of the functions of the methods and systems of the present invention. Alternatively, some or all of the functions may be implemented by a state machine without stored program instructions or in one or more application-specific integrated circuits (ASICs) in which each function or some combination of functions is implemented as a custom logic. Of course, combinations of the above methods can be used. Furthermore, some example embodiments may be implemented via a non-transitory computer readable storage medium having computer readable code stored thereon for programming a computer, server, appliance, device, etc. Each of these may include a processor to perform the functions described and claimed herein. Examples of such computer-readable storage media include, but are not limited to: hard disks, optical storage devices, magnetic storage devices, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM) ), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory, etc. When stored in a non-transitory computer-readable medium, the software may include instructions executable by a processor which, in response to such execution, cause the processor and/or any other circuitry to perform a set of operations, steps, methods, procedures, algorithms, etc.
再次,本发明使包括分析师、营销人员、商业单元领导、信息技术人员和C型雇员在内的公司能够从任何类型的文本数据获得可操作的见解。该技术允许人们根据数据驱动的基础增强他们的决策过程。该技术摄取文本数据,并通过深度计算和统计算法,识别每个数据集内的主题、话题和新出现的问题。用交互的可视化的格式显示结果,使得公司中的任何人可以整体地或精细地分析数据。可以分析所有类型的文本数据-内部数据(例如电子邮件、聊天、调查、呼叫中心和关注小组),或外部数据(例如社会媒体、评论网站、论坛和新闻网站)。该技术可以处理大量语言,确保可以分析来自全世界的反馈环。然而,令人调整分析效果的高度可定制的特征被选择。大多数公司正坐在非结构化文本数据的宝藏上,但是几乎没有能力挖掘非结构化文本数据取得情报。Again, the present invention enables companies including analysts, marketers, business unit leaders, IT staff, and C-type employees to gain actionable insights from any type of textual data. The technology allows people to enhance their decision-making process on a data-driven basis. The technology ingests text data and, through deep computing and statistical algorithms, identifies themes, topics and emerging issues within each data set. Display results in an interactive visual format, allowing anyone in the company to analyze data holistically or granularly. All types of text data can be analyzed - internal data such as emails, chats, surveys, call centers and focus groups, or external data such as social media, review sites, forums and news sites. The technology can handle a large number of languages, ensuring feedback loops from all over the world can be analyzed. However, highly customizable features are selected that allow tuning of the analysis effects. Most companies are sitting on a treasure trove of unstructured text data, but have little ability to mine it for intelligence.
通常,本发明的软件在复杂的可视化平台中传递基于深度学习的数据分析,其在商业决策领域的广阔范围中揭露、分析并推测可执行的策略。它以发现影响销售、客户服务、操作和风险分析利益相关者的数据内的联系的有利方式将呼叫中心音频、电子邮件、新闻、社交媒体、聊天、交易数据、客户反馈和分析联系起来。结构化数据也被利用,包括零售交易、调查数据、个人简档等,以及国家和国际工业、政府和产品特定的数据源。软件是可由任何浏览器装置访问的,整合了预测建模、人工智能、以及统计NLP,以分析任意类型的非结构化数据。可视化是整体地和/或精细地提供。图7中示意性地示出了整个系统40。系统40使用高吞吐量的多语种API,用于使用复杂术语提取、实体指示符提取、地理空间指示符提取、时间指示符提取和意见情绪分析进行信息标记。系统40还使用数据驱动的语义机器学习和聚类,使用自动术语关联、统计话题总结、影响者干扰、上下文感知的内容排序、内容网络关联和产品中心分析。In general, the software of the present invention delivers deep learning-based data analytics in a sophisticated visualization platform that uncovers, analyzes and infers actionable strategies across a wide range of business decision domains. It connects call center audio, email, news, social media, chat, transactional data, customer feedback and analytics in an advantageous way to discover connections within data that impacts sales, customer service, operations and risk analysis stakeholders. Structured data is also leveraged, including retail transactions, survey data, personal profiles, etc., as well as national and international industry, government, and product-specific data sources. The software is accessible from any browser device and integrates predictive modeling, artificial intelligence, and statistical NLP to analyze unstructured data of any type. Visualizations are provided holistically and/or granularly. The overall system 40 is shown schematically in FIG. 7 . The system 40 uses a high-throughput multilingual API for information tagging using complex term extraction, entity indicator extraction, geospatial indicator extraction, temporal indicator extraction, and opinion sentiment analysis. The system 40 also uses data-driven semantic machine learning and clustering, using automatic term association, statistical topic summarization, influencer interference, context-aware content ranking, content network association, and product-centric analysis.
现在具体参照图8和9,在一个示例实施例中,本发明提供了帮助公司找到从数据到收入的最短路径的增强的情报平台45。它把片段的数据孤岛集中到一起,创建了顶层的统一的可视化分析层,并使来自多个商业功能的用户能够有效地并协作地提取有价值的见解。平台45安全地位于组织数据湖的顶端并与数据基础结构的多个等级兼容。它通过深度计算和统计算法自动摄取非结构化数据(例如,电子邮件,通话记录)以及结构化数据(例如,销售、预算、金融)。它实时处理数以千万计的反馈点和数据点,并识别组织内的主题、话题、和正出现的问题。它帮助动态地将客户体验趋势与全部公司数据相关联。平台45是完全交互式的并易于使用。组织中的任何人,来自前线的雇员、分析家、销售者到商业单元领导者和C型雇员,可以与数据整体地或精细地交互,定制他们自身的仪表板并与他人共享发现。除了数据分析后台引擎之外,平台45还以完全增强的用户的UI体验得到支持。本发明为用户提供具有可定制的可视化的像素完美的仪表板。这使得呈现用户的分析工作容易得多并更可控。探索层中的丰富交互允许用户快速开始分析细节并保持上下文信息在它周围。本发明确保,并且灵活的数据分析环境保证用户在潜入细节的同时从不失去一般层面的与数据的联系。这超越了仅几个可视化;将用户体验扩展为各种有用的数据分析和可视化。在分析成果上进行注释和协作前所未有地容易。本发明完全更换了人们可以找到、分享并在分析任务上协作的方式。用户能够注释并与同事分享他们的发现,支持在每个数据分析组内部和外部的协作。总之,本发明通过提供数据分析的拟真环境来增强决策。Referring now specifically to Figures 8 and 9, in one example embodiment, the present invention provides an enhanced intelligence platform 45 that helps companies find the shortest path from data to revenue. It brings together fragmented data silos, creates a unified visual analytics layer on top, and enables users from multiple business functions to efficiently and collaboratively extract valuable insights. Platform 45 sits securely on top of an organization's data lake and is compatible with multiple levels of the data infrastructure. It automatically ingests unstructured data (e.g., emails, call logs) as well as structured data (e.g., sales, budget, finance) through deep computational and statistical algorithms. It processes tens of millions of feedback and data points in real time and identifies themes, topics, and emerging issues within the organization. It helps dynamically correlate customer experience trends with all corporate data. Platform 45 is fully interactive and easy to use. Anyone in the organization, from front-line employees, analysts, and salespeople to business unit leaders and C-type employees, can interact with data holistically or granularly, customize their own dashboards, and share findings with others. In addition to the data analysis background engine, platform 45 is also supported with a fully enhanced user's UI experience. The present invention provides users with pixel-perfect dashboards with customizable visualizations. This makes the analytics job of presenting the user much easier and more controllable. Rich interactions in the exploration layer allow users to quickly start analyzing details and keep contextual information around it. The present invention ensures, and the flexible data analysis environment guarantees that the user never loses connection with the data at a general level while diving into the details. This goes beyond just a few visualizations; extending the user experience to a variety of useful data analysis and visualizations. Annotating and collaborating on analysis results has never been easier. The present invention completely changes the way people can find, share and collaborate on analytical tasks. Users are able to annotate and share their findings with colleagues, enabling collaboration within and outside each data analysis group. In summary, the present invention enhances decision-making by providing a realistic environment for data analysis.
图10是示出本发明的非结构化数据分析系统50的另一示例实施例的示意示图。通常,例如与商业企业紧密相关的客户体验数据52、电信数据54、电子邮件数据56、社交媒体数据58和其他数据60,在数据存储库62中聚合,并且例如互联网数据、政府数据之类的外部数据源64被拉入非结构化数据分析算法66,该非结构化数据分析算法66例如驻留在网络服务器上,并可以经由浏览器访问。如本文以上具体描述的,非结构化数据分析算法66向数据应用预测建模、人工智能和统计NLP,以揭露、分析、推测并可视化可执行的信息。有利地,可以由各种商业单元68、利益相关者或其他用户查看可执行信息,其全部可以添加或用其他方式修改可视化并经由公共交互用户界面70分享结果。FIG. 10 is a schematic diagram illustrating another example embodiment of an unstructured data analysis system 50 of the present invention. Typically, customer experience data 52, telecommunications data 54, email data 56, social media data 58, and other data 60, such as those closely related to a business enterprise, are aggregated in data repositories 62, and such as Internet data, government data, etc. The external data source 64 is pulled into an unstructured data analysis algorithm 66 that resides, for example, on a web server and can be accessed via a browser. As described in detail herein above, unstructured data analysis algorithms 66 apply predictive modeling, artificial intelligence, and statistical NLP to the data to uncover, analyze, infer, and visualize actionable information. Advantageously, the executable information can be viewed by various business units 68 , stakeholders, or other users, all of whom can add or otherwise modify visualizations and share the results via a common interactive user interface 70 .
图11是示出本发明的非结构化数据分析系统50(图8)的呈现层80的一个示例实施例的示意示图;通常,呈现层80允许显示关于非结构化数据和/或结果的各种总结信息。例如,呈现层80被示为显示客户体验数据82、电信数据84和销售数据86。FIG. 11 is a schematic diagram illustrating an example embodiment of a presentation layer 80 of the unstructured data analysis system 50 ( FIG. 8 ) of the present invention; generally, the presentation layer 80 allows display of information about unstructured data and/or results. Various summary information. For example, presentation layer 80 is shown displaying customer experience data 82 , telecommunications data 84 , and sales data 86 .
图12是示出本发明的非结构化的数据分析系统50(图8)的探索层90的一个示例实施例的示意示图。通常,探索层90允许显示关于非结构化数据和/或结果的各种总结信息。探索层90还允许选择时间粒度并用更进一步的细节显示。这种“向下潜入”还相应更新包括呈现层80在内的其他可视化。例如,快照94被示为从客户体验数据92中选择。FIG. 12 is a schematic diagram illustrating an example embodiment of the exploration layer 90 of the unstructured data analysis system 50 ( FIG. 8 ) of the present invention. In general, the exploration layer 90 allows display of various summary information about the unstructured data and/or results. Exploration layer 90 also allows temporal granularity to be selected and displayed with further detail. This "drill down" also updates other visualizations, including the presentation layer 80, accordingly. For example, snapshot 94 is shown selected from customer experience data 92 .
图13是示出本发明的非结构化数据分析系统50(图8)的注释层100的一个示例实施例的示意示图。注释层100被配置为显示各种结果,以及客户体验数据102、电信数据104、电子邮件106、社交媒体数据108、其他数据110等,并接收用户注释112,所述用户注释112可以经由共享用户界面114被全部用户或所选用户访问。Figure 13 is a schematic diagram illustrating an example embodiment of the annotation layer 100 of the unstructured data analysis system 50 (Figure 8) of the present invention. Annotation layer 100 is configured to display various results, as well as customer experience data 102, telecommunications data 104, email 106, social media data 108, other data 110, etc., and to receive user annotations 112, which may be shared via user Interface 114 is accessed by all users or selected users.
尽管本文已经参照优选实施例及其特定示例说明并描述了本发明,但是本领域技术人员将易于理解其他实施例和示例也可以执行类似功能和/或实现类似结果。由此理解,所有这种等价实施例和示例均在本发明的精神和范围内,并旨在由所附权利要求涵盖。Although the invention has been illustrated and described herein with reference to preferred embodiments and specific examples thereof, those skilled in the art will readily appreciate that other embodiments and examples can perform similar functions and/or achieve similar results. It is therefore to be understood that all such equivalent embodiments and examples are within the spirit and scope of the invention and are intended to be covered by the appended claims.
Claims (18)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011265115.5A CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method | 
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| US201562159662P | 2015-05-11 | 2015-05-11 | |
| US15/151,572 | 2016-05-11 | ||
| US15/151,572 US10452698B2 (en) | 2015-05-11 | 2016-05-11 | Unstructured data analytics systems and methods | 
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011265115.5A Division CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN107368506A true CN107368506A (en) | 2017-11-21 | 
| CN107368506B CN107368506B (en) | 2020-11-06 | 
Family
ID=60312579
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011265115.5A Pending CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method | 
| CN201610496280.9A Active CN107368506B (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method | 
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011265115.5A Pending CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (2) | CN112732878A (en) | 
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN108170657A (en) * | 2018-01-04 | 2018-06-15 | 陆丽娜 | A kind of natural language long text generation method | 
| CN109299286A (en) * | 2018-09-28 | 2019-02-01 | 北京赛博贝斯数据科技有限责任公司 | The Knowledge Discovery Method and system of unstructured data | 
| CN110413782A (en) * | 2019-07-23 | 2019-11-05 | 杭州城市大数据运营有限公司 | A kind of table automatic theme classification method, device, computer equipment and storage medium | 
| CN112883186A (en) * | 2019-11-29 | 2021-06-01 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map | 
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text Collection Visualization System | 
| CN102750355A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Visual management method for non-structured data management system | 
| CN102929894A (en) * | 2011-08-12 | 2013-02-13 | 中国人民解放军总参谋部第五十七研究所 | Online clustering visualization method of text | 
| US20140040275A1 (en) * | 2010-02-09 | 2014-02-06 | Siemens Corporation | Semantic search tool for document tagging, indexing and search | 
| US9135242B1 (en) * | 2011-10-10 | 2015-09-15 | The University Of North Carolina At Charlotte | Methods and systems for the analysis of large text corpora | 
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| WO2003005235A1 (en) * | 2001-07-04 | 2003-01-16 | Cogisum Intermedia Ag | Category based, extensible and interactive system for document retrieval | 
| US7849048B2 (en) * | 2005-07-05 | 2010-12-07 | Clarabridge, Inc. | System and method of making unstructured data available to structured data analysis tools | 
| KR101481253B1 (en) * | 2013-03-14 | 2015-01-13 | 한국과학기술원 | Method and system for providing summery of text document using word cloud | 
| CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system | 
| US20160071212A1 (en) * | 2014-09-09 | 2016-03-10 | Perry H. Beaumont | Structured and unstructured data processing method to create and implement investment strategies | 
- 
        2016
        - 2016-06-28 CN CN202011265115.5A patent/CN112732878A/en active Pending
- 2016-06-28 CN CN201610496280.9A patent/CN107368506B/en active Active
 
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text Collection Visualization System | 
| US20140040275A1 (en) * | 2010-02-09 | 2014-02-06 | Siemens Corporation | Semantic search tool for document tagging, indexing and search | 
| CN102929894A (en) * | 2011-08-12 | 2013-02-13 | 中国人民解放军总参谋部第五十七研究所 | Online clustering visualization method of text | 
| US9135242B1 (en) * | 2011-10-10 | 2015-09-15 | The University Of North Carolina At Charlotte | Methods and systems for the analysis of large text corpora | 
| CN102750355A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Visual management method for non-structured data management system | 
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN108170657A (en) * | 2018-01-04 | 2018-06-15 | 陆丽娜 | A kind of natural language long text generation method | 
| CN109299286A (en) * | 2018-09-28 | 2019-02-01 | 北京赛博贝斯数据科技有限责任公司 | The Knowledge Discovery Method and system of unstructured data | 
| CN110413782A (en) * | 2019-07-23 | 2019-11-05 | 杭州城市大数据运营有限公司 | A kind of table automatic theme classification method, device, computer equipment and storage medium | 
| CN110413782B (en) * | 2019-07-23 | 2022-08-26 | 杭州城市大数据运营有限公司 | Automatic table theme classification method and device, computer equipment and storage medium | 
| CN112883186A (en) * | 2019-11-29 | 2021-06-01 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map | 
| CN112883186B (en) * | 2019-11-29 | 2024-04-12 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN112732878A (en) | 2021-04-30 | 
| CN107368506B (en) | 2020-11-06 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| US11003864B2 (en) | Artificial intelligence optimized unstructured data analytics systems and methods | |
| US10452698B2 (en) | Unstructured data analytics systems and methods | |
| US9135242B1 (en) | Methods and systems for the analysis of large text corpora | |
| Isenberg et al. | Visualization as seen through its research paper keywords | |
| Dou et al. | Paralleltopics: A probabilistic approach to exploring document collections | |
| Sun et al. | EvoRiver: Visual analysis of topic coopetition on social media | |
| Brehmer et al. | A multi-level typology of abstract visualization tasks | |
| US8296666B2 (en) | System and method for interactive visual representation of information content and relationships using layout and gestures | |
| Fu et al. | T-cal: Understanding team conversational data with calendar-based visualization | |
| Xu et al. | Chart Constellations: Effective Chart Summarization for Collaborative and Multi‐User Analyses | |
| Perry et al. | VizDeck: Streamlining exploratory visual analytics of scientific data | |
| Verbert et al. | Agents vs. users: visual recommendation of research talks with multiple dimension of relevance | |
| Kodagoda et al. | Using interactive visual reasoning to support sense-making: Implications for design | |
| CN107368506B (en) | Unstructured data analysis system and method | |
| Bier et al. | Principles and tools for collaborative entity-based intelligence analysis | |
| Nikiforova et al. | Mapping of Source and Target Data for Application to Machine Learning Driven Discovery of IS Usability Problems. | |
| Greitzer et al. | Cognitive foundations for visual analytics | |
| Kim et al. | Visualization support for multi-criteria decision making in software issue propagation | |
| Boumaiza | A survey on sentiment analysis and visualization | |
| Verspoor et al. | Commviz: Visualization of semantic patterns in large social communication networks | |
| Basole | Visual analytics for innovation and R&D intelligence | |
| Toic et al. | Analysis of selected business intelligence data visualization tools | |
| Lemieux | Using information visualization and visual analytics to achieve a more sustainable future for archives: A survey and critical analysis of some developments | |
| Ghahramani et al. | Visualisation for social media analytics: landscape of R packages | |
| Nguyen | Visualization of analytic provenance for sensemaking | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |