CN104536956A - A Microblog platform based event visualization method and system - Google Patents
A Microblog platform based event visualization method and system Download PDFInfo
- Publication number
- CN104536956A CN104536956A CN201410354273.6A CN201410354273A CN104536956A CN 104536956 A CN104536956 A CN 104536956A CN 201410354273 A CN201410354273 A CN 201410354273A CN 104536956 A CN104536956 A CN 104536956A
- Authority
- CN
- China
- Prior art keywords
- event
- word
- microblog
- microblogs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及信息抽取及可视化技术,特别涉及一种基于微博平台的事件可视化方法及系统。The invention relates to information extraction and visualization technology, in particular to an event visualization method and system based on a microblog platform.
背景技术Background technique
随着互联网的飞速发展,近年来各种社交媒体应运而生,常见的有Facebook(脸书)、Twitter(推特)、新浪微博、人人网,其中以Twitter、新浪微博为代表的微博平台以其开放的信息分享及传播特性成为当前热门的互联网应用。With the rapid development of the Internet, various social media have emerged in recent years, the common ones are Facebook (Facebook), Twitter (Twitter), Sina Weibo, and Renren, among which Twitter and Sina Weibo are the representatives. The microblog platform has become a popular Internet application due to its open information sharing and dissemination characteristics.
微博,即微博客(Microblog)的简称,用户可以在平台上随时随地发布140字以内的文字、图片、视频等信息。微博具有原创性、时效性、碎片性、重复性等特点。在微博平台当中,用户可以搜索查看自己感兴趣的话题,浏览话题相关的内容并参与话题内容的讨论。但是由于在微博平台当中,充斥着大量关于某一个事件的相关微博,同时因为微博的短文本特性,这带来了发布信息碎片化、难理解等问题。发布信息参差不齐在微博平台是一个十分显著的现象。正是因为上述的各种原因,使得用户对很难在短时间内快速了解一个事件的发展动态,使用户交互体验变差。Microblog, short for Microblog, users can publish text, pictures, videos and other information within 140 characters anytime and anywhere on the platform. Weibo has the characteristics of originality, timeliness, fragmentation and repetition. In the Weibo platform, users can search and view topics they are interested in, browse the content related to the topic and participate in the discussion of the topic content. However, because the Weibo platform is filled with a large number of Weibo related to a certain event, and because of the short text characteristics of Weibo, this brings problems such as fragmentation and difficulty in understanding the published information. Uneven information release is a very significant phenomenon on the Weibo platform. It is precisely because of the above-mentioned various reasons that it is difficult for users to quickly understand the development of an event in a short period of time, which makes the user interaction experience worse.
在现有微博事件可视化的技术当中,一般简单对事件相关的微博按时间进行排序,将最近时间段内的微博展示给用户,也有按微博的热度进行排序,将热门的微博展示给用户,另外还有方法通过选择一定时间范围内的微博进行时间或者热度排序的展示。以上这些展示方法均为对原始微博内容的直接展示,具有多方面的不足之处。第一,由于网络信息量呈爆炸式增长,传统的对原始微博进行可视化展示的方法很难让用户快速的获取事件相关的信息内容;第二,由于微博的短文本特性,微博发布信息质量参差不齐,微博口语化的问题使得用户很难快速理解微博的内容,而要从微博文本中挖掘关于事件的重要信息更是大海捞针。Among the existing microblog event visualization technologies, it is generally simple to sort the event-related microblogs by time, display the microblogs in the latest time period to the user, and also sort by the popularity of the microblogs, and list the popular microblogs To display to users, there is another way to display by selecting Weibo within a certain time range in order of time or popularity. The above display methods are all direct displays of the original microblog content, and have many deficiencies. First, due to the explosive growth of network information, the traditional method of visually displaying the original Weibo is difficult for users to quickly obtain event-related information; second, due to the short text characteristics of Weibo, Weibo release The quality of information is uneven, and the problem of colloquial Weibo makes it difficult for users to quickly understand the content of Weibo, and it is even more difficult to find important information about events from Weibo texts.
在事件可视化方法当中,有一类是对事件的所有文本信息,进行关键词抽取,然后对抽取的关键词通过一个词云来展示。这种方式,可以让微博用户从主要的关键词当中了解事件主要的话题,但是微博用户并不能对事件的各个子事件及事件的发展演变有一个直观的了解。Among the event visualization methods, one is to extract keywords from all the text information of the event, and then display the extracted keywords through a word cloud. In this way, microblog users can understand the main topic of the event from the main keywords, but microblog users cannot have an intuitive understanding of each sub-event of the event and the development and evolution of the event.
另外一些可视化的方式通过抽取事件当中的人物、地点、事件摘要句,将它们作为事件发展的节点信息,以它们之间的关联关系为边,对事件进行可视化的展示。但是这种基于人物、地点、事件摘要句的可视化展示方式对于微博事件来说具有很大的局限性,因为微博不像正式的新闻报道具有规范的人物、地点、组织机构等信息,所以从微博中很难获取这些信息。因此这种可视化方式对于微博来说具有很大的局限性。Some other visualization methods extract the characters, places, and event summary sentences in the event, use them as the node information of the event development, and use the relationship between them as the edge to visually display the event. However, this visual display method based on people, places, and event summary sentences has great limitations for Weibo events, because Weibo does not have standardized people, places, organizations, etc. information like formal news reports, so It is difficult to obtain such information from Weibo. Therefore, this visualization method has great limitations for Weibo.
发明专利“基于用户兴趣挖掘的微博词云生成方法及访问支持系统”,该发明公开一种基于用户兴趣挖掘的微博词云生成方法及微博消息访问支持系统,该方法包括:给定当前登录用户所关注用户新发布的微博消息集,从中抽取出关键词集;分别基于用户关系、基于关键词的相似度计算当前登录用户对该关键词集中关键词的兴趣度,并将两种计算所得的兴趣度融合,计算最终兴趣度;从所述关键词集中选择兴趣度最高的k个关键词;在一个区域内显示所选择出的k个关键词。该系统包括用户信息获取模块、词云生成器等关键模块。该发明能够使用户更加高效地从微博消息中获取其感兴趣的信息。但是本发明与该发明研究对象不同:该发明以微博用户为研究对象,分析微博用户的微博内容,通过提取关键词进行词云展示。而本发明以新闻事件为研究对象;可视化的不同:该发明仅对微博进行关键词抽取以词云方式进行展示。而本发明以事件的子事件进行关键词抽取,进行组合词云的多维度展示。Invention patent "microblog word cloud generation method and access support system based on user interest mining", the invention discloses a microblog word cloud generation method and microblog message access support system based on user interest mining, the method includes: given The newly released microblog message set of the user concerned by the currently logged-in user is extracted from the set of keywords; the interest degree of the currently logged-in user to the keywords in the keyword set is calculated based on the user relationship and the similarity of the keywords respectively, and the two The degree of interest obtained from the calculations is fused to calculate the final degree of interest; the k keywords with the highest degree of interest are selected from the keyword set; the selected k keywords are displayed in an area. The system includes key modules such as user information acquisition module and word cloud generator. The invention can enable users to obtain information they are interested in from microblog messages more efficiently. However, the research object of the present invention is different from that of the invention: the invention takes microblog users as the research object, analyzes microblog content of microblog users, and performs word cloud display by extracting keywords. However, the present invention takes news events as the research object; the difference in visualization: this invention only extracts keywords from microblogs and displays them in word cloud form. However, the present invention uses sub-events of events to extract keywords and perform multi-dimensional display of combined word clouds.
发明专利“基于微博的事件特征演化挖掘方法及系统”,该发明公开一种基于微博的事件特征演化挖掘方法,包括:在微博时序序列中选取演化起始文档集,并在微博文档集合上基于词汇的共现特征构造文档的图模型以得到事件的知识网络结构;依据词汇的字面特征,词汇倾向性的相容性特征将微博图模型进行合并,构造事件特征的微观演化图;在事件的微观演化图上进行剪枝、切分和转化,形成事件特征的宏观演化图。该方法在挖掘事件特征的演化规律过程中采用了基于事件的知识网络的图挖掘方法,使得整个事件特征演化挖掘方法在知识的继承性方面得到提升,挖掘结果的可解释性更强。但是本发明与该发明特征抽取不同:该发明主要从词汇结构上进行特征抽取,通过构建知识网络结构进行事件的演化展示。本发明主要对事件聚类,挖掘事件的子话题特征信息进行演化展示。Invention patent "Microblog-based event feature evolution mining method and system", the invention discloses a microblog-based event feature evolution mining method, including: selecting the evolution starting document set in the microblog time series, and Based on the co-occurrence features of vocabulary on the document collection, the document graph model is constructed to obtain the knowledge network structure of the event; according to the literal features of the vocabulary and the compatibility features of the vocabulary tendency, the microblog graph model is merged to construct the micro-evolution of event features Graph; pruning, segmentation and transformation are performed on the micro-evolution graph of the event to form a macro-evolution graph of event characteristics. This method adopts the event-based knowledge network graph mining method in the process of mining the evolution rules of event features, which improves the inheritance of knowledge in the entire event feature evolution mining method, and the interpretability of mining results is stronger. However, the present invention is different from the feature extraction of this invention: this invention mainly performs feature extraction from the vocabulary structure, and performs evolution display of events by building a knowledge network structure. The present invention mainly clusters events and performs evolution display of subtopic characteristic information of mining events.
发明内容Contents of the invention
针对现有技术不足,本发明提出了一种基于微博平台的事件可视化方法及系统,以解决以上技术问题。Aiming at the deficiencies of the prior art, the present invention proposes an event visualization method and system based on a microblog platform to solve the above technical problems.
本发明提出了一种基于微博平台的事件可视化方法,包括:The present invention proposes an event visualization method based on a microblog platform, including:
步骤1,根据该事件的关键词和时间范围,通过该微博平台的事件搜索接口,检索与该事件相关的该时间范围内的微博;Step 1, according to the keywords and time range of the event, retrieve the microblogs related to the event within the time range through the event search interface of the microblog platform;
步骤2,将该微博按照时间进行排序,生成一个微博集合;Step 2, sort the microblogs according to time to generate a microblog collection;
步骤3,该微博集合通过聚类算法,生成多个聚类子集;Step 3, the microblog collection generates multiple cluster subsets through a clustering algorithm;
步骤4,对该多个聚类子集进行关键词抽取,生成多个词云,并将重复出现在该多个词云中的该关键词赋予相同的颜色、位置、旋转方式;Step 4, performing keyword extraction on the multiple clustering subsets, generating multiple word clouds, and assigning the same color, position, and rotation mode to the keywords repeatedly appearing in the multiple word clouds;
步骤5,通过将每个该聚类子集和与其相对应的该词云进行展示的方式,将该事件进行可视化展示。Step 5: Visually display the event by displaying each cluster subset and the word cloud corresponding to it.
所述的基于微博平台的事件可视化方法,该步骤2之前还包括:The described event visualization method based on the Weibo platform, before the step 2, also includes:
步骤21,过滤该时间范围内的该微博中字数小于某阈值的微博;Step 21, filtering the microblogs whose word count is less than a certain threshold in the microblogs within the time range;
步骤22,过滤该时间范围内的该微博中热度小于某阈值的微博;Step 22, filtering the microblogs with popularity less than a certain threshold in the microblogs within the time range;
步骤23,过滤该时间范围内的该微博中非文本格式的信息;Step 23, filtering information in non-text format in the microblog within the time range;
步骤24,过滤该时间范围内的该微博中的“用户名”。Step 24, filter the "username" in the microblog within the time range.
所述的基于微博平台的事件可视化方法,该步骤22中该热度的计算公式为:In the event visualization method based on the microblog platform, the calculation formula of the popularity in step 22 is:
其中retweets代表微博转发数量,comments代表微博的评论数,Heat代表微博热度。Among them, retweets represents the number of retweets on Weibo, comments represents the number of comments on Weibo, and Heat represents the popularity of Weibo.
所述的基于微博平台的事件可视化方法,该步骤4中对每个该聚类子集进行关键词抽取,生成组合词云的具体步骤包括:Described event visualization method based on microblogging platform, carries out keyword extraction to each this clustering subset in this step 4, and the specific steps of generating combined word cloud include:
步骤41,对每个该聚类子集进行分词处理,生成词语集合;Step 41, performing word segmentation processing on each of the clustering subsets to generate a word set;
步骤42,通过维基百科词条与网络热词对该词语集合进行合并,生成该组合词云。Step 42, merging the word set through Wikipedia entries and Internet hot words to generate the combined word cloud.
所述的基于微博平台的事件可视化方法,其特征在于,该步骤4还包括:根据逆文档频率,将该词语赋予高透明度。The event visualization method based on the microblog platform is characterized in that step 4 further includes: according to the inverse document frequency, assigning high transparency to the word.
本发明还提出了一种基于微博平台的事件可视化系统,包括:The present invention also proposes a microblog platform-based event visualization system, including:
检索模块,用于根据该事件的关键词和时间范围,通过该微博平台的事件搜索接口,检索与该事件相关的该时间范围内的微博;A retrieval module, configured to retrieve microblogs related to the event within the time range through the event search interface of the microblog platform according to the keywords and time range of the event;
排序模块,用于将该微博按照时间进行排序,生成一个微博集合;The sorting module is used to sort the microblogs according to time to generate a set of microblogs;
聚类模块,用于该微博集合通过聚类算法,生成多个聚类子集;The clustering module is used for the microblog collection to generate multiple clustering subsets through a clustering algorithm;
生成组合词云模块,用于对该多个聚类子集进行关键词抽取,生成多个词云,并将重复出现在该多个词云中的该关键词赋予相同的颜色、位置、旋转方式;Generate a combined word cloud module, which is used to extract keywords from the multiple cluster subsets, generate multiple word clouds, and assign the same color, position, and rotation to the keywords that repeatedly appear in the multiple word clouds Way;
展示模块,用于通过将每个该聚类子集和与其相对应的该词云进行展示的方式,将该事件进行可视化展示。The display module is configured to display the event visually by displaying each cluster subset and the corresponding word cloud.
所述的基于微博平台的事件可视化系统,还包括过滤模块,用于过滤该时间范围内的该微博中字数小于某阈值的微博;过滤该时间范围内的该微博中热度小于某阈值的微博;过滤该时间范围内的该微博中非文本格式的信息;过滤该时间范围内的该微博中的“用户名”。The event visualization system based on the microblog platform also includes a filtering module, which is used to filter the microblogs whose word count in the microblogs within the time range is less than a certain threshold; Threshold microblogs; filter information in non-text format in the microblogs within the time range; filter "username" in the microblogs in the time range.
所述的基于微博平台的事件可视化系统,该过滤模块中该热度的计算公式为:In the event visualization system based on the microblog platform, the calculation formula of the popularity in the filtering module is:
其中retweets代表微博转发数量,comments代表微博的评论数,Heat代表微博热度。Among them, retweets represents the number of retweets on Weibo, comments represents the number of comments on Weibo, and Heat represents the popularity of Weibo.
所述的基于微博平台的事件可视化系统,该生成组合词云模块中对每个该聚类子集进行关键词抽取,生成组合词云的具体步骤包括:对每个该聚类子集进行分词处理,生成词语集合;通过维基百科词条与网络热词对该词语集合进行合并,生成该组合词云。In the event visualization system based on the microblog platform, keyword extraction is performed on each of the clustering subsets in the generating combined word cloud module, and the specific steps of generating the combined word cloud include: performing a keyword extraction on each of the clustering subsets Word segmentation processing to generate a word set; combine the word set through Wikipedia entries and Internet hot words to generate the combined word cloud.
所述的基于微博平台的事件可视化系统,该展示模块还用于:根据逆文档频率,将该词语赋予高透明度。In the event visualization system based on the Weibo platform, the display module is also used to: according to the inverse document frequency, give the word a high degree of transparency.
由以上方案可知,本发明的优点在于:As can be seen from the above scheme, the present invention has the advantages of:
依托微博平台,通过事件关键词对相关的微博进行采集,可以全面的获取关于某个事件的微博信息;采用微博信息过滤技术,可以得到高质量有意义的微博信息;通过对事件的微博数据集进行基于时间维度的聚类,得到的事件聚类子集具有时间维度的信息,这些数据子集在既可以代表事件的某个话题,也可以从总体上看出事件的发展演变过程;通过关键词抽取技术,可以从一组微博中抽取出代表性的微博关键词,一组事件的关键词可以让微博用户对微博内容有个直观的了解;通过控制多个词云当中相同词语的颜色、位置信息,使得它们在组合词云的可视化显示过程中具有高度的一致性,使得微博用户可以很方便的通过组合词云看出整个事件的主要话题,以及各个子事件当中的话题,并可以很方便的对各个子事件进行对比分析。Relying on the microblog platform, collecting related microblogs through event keywords can comprehensively obtain microblog information about a certain event; adopting microblog information filtering technology can obtain high-quality and meaningful microblog information; The microblog data set of the event is clustered based on the time dimension, and the obtained event cluster subset has the information of the time dimension. These data subsets can not only represent a certain topic of the event, but also can be seen in general. The development and evolution process; through the keyword extraction technology, representative microblog keywords can be extracted from a group of microblogs, and the keywords of a group of events can allow microblog users to have an intuitive understanding of the microblog content; through control The color and position information of the same word in multiple word clouds makes them highly consistent in the visual display process of the combined word cloud, so that Weibo users can easily see the main topic of the entire event through the combined word cloud. and the topics in each sub-event, and it is very convenient to compare and analyze each sub-event.
附图说明Description of drawings
图1为基于微博平台的时间可视化方法的流程图;Fig. 1 is the flow chart of the time visualization method based on microblog platform;
图2为组合词云可视化展示流程图;Fig. 2 is the flow chart of combined word cloud visualization display;
图3为事件的可视化展示实例图。Fig. 3 is an example diagram of visual display of events.
其中附图标记为:Wherein reference sign is:
步骤100基于微博平台的事件可视化方法的具体步骤,包括:Step 100 is based on the specific steps of the microblog platform event visualization method, including:
步骤101/102/103/104/105/106/107。Step 101/102/103/104/105/106/107.
具体实施方式Detailed ways
下面结合附图和实施例详细对本发明的具体实施方式进行说明。The specific implementation manners of the present invention will be described in detail below in conjunction with the drawings and embodiments.
本发明的具体流程包括以下步骤,如图1所示:Concrete flow process of the present invention comprises the following steps, as shown in Figure 1:
步骤101,模拟登陆微博平台;Step 101, simulating login to Weibo platform;
由于本发明是针对微博平台的新闻事件进行可视化展示,所以在获取事件的信息之前,需要模拟用户登陆微博网站的过程。Since the present invention is aimed at visually displaying news events on the microblog platform, it is necessary to simulate the process of a user logging in to the microblog website before obtaining event information.
在模拟登陆微博平台这一过程当中,首先注册一批微博账户,利用这些账户信息构成模拟登陆的用户信息表,在进行模拟登陆时,其次向微博的站点发送登陆页面的请求链接,利用本地的注册用户信息表,就可以向站点提供登陆所需的用户名、密码、加密方式等参数,实现用户的模拟登陆操作。In the process of simulating the login to the Weibo platform, firstly, a batch of Weibo accounts are registered, and the user information table of the simulated login is formed by using these account information. During the simulated login, the request link of the login page is sent to the Weibo site next, Using the local registered user information form, you can provide the site with parameters such as user name, password, and encryption method required for login, and realize the user's simulated login operation.
由于微博平台对用户在一定时间范围内的操作具有访问次数的限制,过度频繁的访问可能会造成账户封锁的现象,所以当一个用户登陆成功之后,在用户访问的页面次数超过一定次数时,就从本地的用户信息表中选择另外一个用户进行模拟登陆操作,通过这种方式,就可以对微博平台进行的各项服务进行访问请求,获得所需要的新闻事件数据信息。Since the Weibo platform has restrictions on the number of visits that users can perform within a certain period of time, excessively frequent visits may cause the account to be blocked. Therefore, after a user successfully logs in, when the number of pages visited by the user exceeds a certain number of times, Just select another user from the local user information table to perform a simulated login operation. In this way, you can request access to various services on the Weibo platform and obtain the required news event data information.
步骤102,根据事件关键词检索相关的微博;Step 102, retrieving relevant microblogs according to event keywords;
一个事件通常由关键词和时间两部分组成。通过在一定时间范围内进行筛选,就可以通过微博平台获取到指定时间范围内的微博。An event usually consists of keywords and time. By filtering within a certain time range, microblogs within a specified time range can be obtained through the microblog platform.
在本步骤中,依托微博平台提供的事件搜索接口,通过用户输入的事件关键词及时间范围,获取相关的微博页面。In this step, rely on the event search interface provided by the microblog platform, and obtain relevant microblog pages through the event keywords and time range input by the user.
步骤103,微博信息预处理;Step 103, microblog information preprocessing;
在本步骤中对微博信息进行预处理,得到待分析的数据集。具体的处理包括如下几部分:In this step, the microblog information is preprocessed to obtain the data set to be analyzed. The specific processing includes the following parts:
对数据集当中的短文本进行过滤处理,将字数小于某个阈值的微博过滤;Filter the short text in the data set, and filter the microblogs whose word count is less than a certain threshold;
过滤掉数据集当中影响力较小、冷门的微博(即微博热度小于某一阈值的微博),微博热度按下式进行计算:Filter out the less influential and unpopular microblogs in the data set (that is, microblogs whose popularity is less than a certain threshold), and the microblog popularity is calculated according to the following formula:
其中retweets代表微博转发数量,comments代表微博的评论数;Among them, retweets represents the number of retweets on Weibo, and comments represent the number of comments on Weibo;
过滤微博中的表情符号、网页链接地址等非文本格式化的内容信息;Filter non-text-formatted content information such as emoticons and web page link addresses in Weibo;
对微博中特有的“用户名”进行过滤处理;Filter the unique "username" in Weibo;
根据微博的时间信息进行排序处理,得到时间上连续的微博集合。Sorting is performed according to the time information of microblogs to obtain a set of microblogs continuous in time.
步骤104,微博事件聚类;Step 104, microblog event clustering;
在该步骤中,对排序好的微博数据集进行聚类处理,得到在时间上连续的聚类子集。为了使各个聚类子集能代表一类的话题,采用层次聚类算法或者单遍聚类算法(Single-Pass Clustering),同时为了使聚类事件在时间上保持一定的连续性,取数据集中的第一条微博作为初始的一个聚类子集,在之后的每一步当中,都将文档划分到与该文档最相似的聚类子集当中,如果该文档与当前所有的文档的相似度都小于设定的阈值,则将他作为一个新的聚类子集,其中文档相似度计算采用如下公式度量:In this step, the sorted microblog data sets are clustered to obtain time-continuous cluster subsets. In order to make each clustering subset represent a class of topics, a hierarchical clustering algorithm or a single-pass clustering algorithm (Single-Pass Clustering) is used. The first microblog of the document is used as an initial cluster subset. In each subsequent step, the document is divided into the cluster subset most similar to the document. If the similarity between the document and all current documents is If they are all smaller than the set threshold, it will be regarded as a new cluster subset, and the document similarity calculation is measured by the following formula:
其中,m代表在文档d所在时间之前时间窗口中的文档数量,i表示聚类c中与文档d时间距离相距最近的文档在时间窗口当中的位置。通过上述方式的计算,文档距离聚类的时间越近,其相似度就越高。在计算文档的相似度时,对文档建立向量空间模型,将每一篇文档表示成空间中的向量,向量中的每一个项是文档中的词语,每个项的权值本发明采用归一化的TF-IDF(termfrequency–inverse document frequency)来计算,公式如下:Among them, m represents the number of documents in the time window before the time of document d, and i represents the position of the document in cluster c with the closest time distance to document d in the time window. Through the calculation in the above method, the closer the document is to the clustering time, the higher the similarity. When calculating the similarity of documents, a vector space model is established for the documents, and each document is represented as a vector in the space, each item in the vector is a word in the document, and the weight of each item is normalized in the present invention Calculated using the simplified TF-IDF (termfrequency–inverse document frequency), the formula is as follows:
其中,ni,j是该词在文档d中的出现次数,则是在文档中所有字词的出现次数之和。D表示所有文件的数量,|{j:ti∈dj}|表示包含词语ti的文件数量。Among them, n i,j is the number of occurrences of the word in document d, is the sum of the occurrences of all words in the document. D represents the number of all documents, and |{j:t i ∈ d j }| represents the number of documents containing word t i .
步骤105,子事件数据集关键词抽取;Step 105, sub-event dataset keyword extraction;
通过对所有相关的微博进行聚类处理,得到在时间上具有一定连续性的子数据集合,其中每个子数据集合代表该事件一个子话题,通过对各个子事件进行微博的关键词抽取,就可以得到需要进行组合词云可视化展示的候选关键词集合。通过如下的方式进行关键词抽取:By clustering all related microblogs, sub-data sets with a certain continuity in time are obtained, where each sub-data set represents a sub-topic of the event. By extracting keywords from microblogs for each sub-event, A set of candidate keywords that need to be visually displayed in the combined word cloud can be obtained. Keyword extraction is performed in the following ways:
首先对文档集合的每一篇文档进行分词处理,本发明采用ICTCLAS分词工具(Institute of Computing Technology,Chinese Lexical Analysis System,主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典)进行文档分词处理,得到处理后的单词集合。为了使词语的语义信息更加丰富,采用维基百科词条及网络热词两个词典对原始单词集合进行短语合并,得到意义更加丰富的词语集合,在上述短语合并过程中,采用基于最大匹配的算法对原始单词集合进行处理,根据以下方式衡量每个词语的权重:First, word segmentation is performed on each document in the document collection. The present invention adopts the ICTCLAS word segmentation tool (Institute of Computing Technology, Chinese Lexical Analysis System, whose main functions include Chinese word segmentation; part-of-speech tagging; named entity recognition; new word recognition; simultaneously supports user dictionary) to perform document word segmentation processing to obtain the processed word set. In order to make the semantic information of words richer, two dictionaries of Wikipedia entries and hot words on the Internet are used to phrase merge the original word set to obtain a more meaningful word set. In the above phrase merge process, an algorithm based on maximum matching is used The original set of words is processed and each word is weighted according to:
wi=tfi×dfi×Heat×|wi|w i =tf i ×df i ×Heat×|w i |
其中tfi表示词语i在文档中出现的频率,dfi表示文档集合中包含词语i的文档数量,Heat表示微博的热度,|wi|表示词语i的长度,即字的个数。Among them, tf i represents the frequency of word i appearing in the document, df i represents the number of documents containing word i in the document collection, Heat represents the popularity of Weibo, |w i | represents the length of word i, that is, the number of words.
为了突出热门微博中的词语,本发明将每个词语出现的权重与微博热度相融合,将微博的热度与词语出现的频率相乘,作为词语的权重,这样选出来的词语更加具有意义,并且基于长词相对短词有更丰富的语义信息,所以本发明引入词语长度项,让长词的权重相对增大。In order to highlight words in popular microblogs, the present invention combines the weight of each word with the popularity of microblogs, and multiplies the heat of microblogs with the frequency of words to be used as the weight of words, so that the words selected are more meaning, and based on the fact that long words have richer semantic information than short words, so the present invention introduces the word length item, so that the weight of long words is relatively increased.
步骤106,基于组合词云的事件可视化;Step 106, event visualization based on combined word cloud;
一种简单的生成组合词云的方式是采用标签云技术,对事件中的每一个子事件生成一个词云,但是这种方式产生的词云在可视化上并不好,因为即使两个话题讨论的内容很相似,他们的词云也会非常不同。因此,在进行组合词云展示的时候,本发明需要对产生的词云进行优化处理,具体处理方式如下,如图2所示:A simple way to generate a combined word cloud is to use tag cloud technology to generate a word cloud for each sub-event in the event, but the word cloud generated by this method is not good in visualization, because even if two topics discuss content is very similar, their word clouds will be very different. Therefore, when performing combined word cloud display, the present invention needs to optimize the generated word cloud, and the specific processing method is as follows, as shown in Figure 2:
出现在多个词云当中的词语赋予相同的颜色、位置、旋转方式,使它们在可视化效果上保持属性的一致性,方便读者快速浏览找到话题之间的共性。Words that appear in multiple word clouds are given the same color, position, and rotation method, so that they maintain the consistency of attributes in the visualization effect, and it is convenient for readers to quickly browse and find commonalities between topics.
以词语的idf(逆文档频率)来控制单词的透明度,使得在多个词云中共同出现的词语赋予较高的透明度,而使文档频率低的词语赋予较低的透明度。通过这种方式来突出各个词云当中独特的词语而淡化多文档出现的高频词语,从而使读者很快的掌握话题所讨论的内容。The transparency of words is controlled by the idf (inverse document frequency) of words, so that words that appear together in multiple word clouds are given higher transparency, while words with low document frequency are given lower transparency. In this way, the unique words in each word cloud are highlighted and the high-frequency words that appear in multiple documents are downplayed, so that readers can quickly grasp the content discussed in the topic.
步骤107,事件可视化展示;Step 107, event visual display;
以时间节点为纵坐标(取每个聚类中所有事件的平均时间作为时间结点),将聚类子集通过文本信息和组合词云的方式展示出来,以此展现该特定事件的演变过程,同时让读者在快速掌握事件话题的同时了解各个子事件的细节内容。Take the time node as the ordinate (take the average time of all events in each cluster as the time node), and display the cluster subset through text information and combined word cloud to show the evolution process of the specific event , and at the same time allow readers to understand the details of each sub-event while quickly grasping the topic of the event.
图3给出了事件可视化展示的一个具体实施方式,整个可视化图以一根时间轴贯穿,左边的原点代表每一个时间结点。在时间轴的右边分为两列展示框,一个为子事件微博聚类结果展示框,另一个为子事件的词云。实施例中以事件“深圳暴雨”为例,取其中的三个子事件进行可视化展示:第一个为深圳暴雨造成路面积水给人们出行带来不便;第二个为深圳暴雨不能阻挡深圳人买房;第三个为深圳暴雨由红色预警降为黄色。首先可以从整体上看出事件在一段时间内的发展趋势,通过观察三个词云,可以看出“深圳”、“暴雨”等词在三个词云中都出现,说明各个子事件有共同的话题特性,由于他们出现的文档频率较高,所以被赋予了较高的透明度。Figure 3 shows a specific implementation of event visualization. The entire visualization is run through by a time axis, and the origin on the left represents each time node. There are two columns of display boxes on the right side of the time axis, one is the sub-event Weibo clustering result display box, and the other is the word cloud of the sub-event. In the embodiment, the event "Shenzhen heavy rain" is taken as an example, and three sub-events are taken for visual display: the first is that the heavy rain in Shenzhen caused road accumulation and inconvenience for people to travel; the second is that the heavy rain in Shenzhen cannot prevent Shenzhen people from buying houses ; The third is that the rainstorm in Shenzhen has been lowered from red to yellow. First of all, we can see the development trend of the event over a period of time as a whole. By observing the three word clouds, we can see that words such as "Shenzhen" and "storm" appear in the three word clouds, indicating that each sub-event has a common The topic characteristics of , are endowed with higher transparency due to their higher document frequency.
从另外一个方面,可以看出各个词云具有代表性的词语,这些词语通常在本数据集合中出现的频率高而在其他数据子集中出现较低或者不出现。比如词云一中的“积水”、“出行”、词云二中的“买房”、“开盘”、词云三中的“全市”、“预警”等词语。由于这些词语的低文档频率,所以这些词语往往具有较低的透明度,从而在词云中更加突显。根据这些词语读者可以快速的了解各个话题谈论的主要内容,通过观察一个词语在其它词云中相同的位置是否出现可以对比两个文档话题之间的差异现象。From another aspect, it can be seen that each word cloud has representative words, and these words usually appear frequently in this data set but appear less or do not appear in other data subsets. For example, words such as "accumulated water" and "travel" in word cloud 1, "buying a house" and "opening" in word cloud 2, and "whole city" and "warning" in word cloud 3. Due to their low document frequency, these terms tend to have less transparency and thus stand out more in the word cloud. Based on these words, readers can quickly understand the main content of each topic, and compare the differences between two document topics by observing whether a word appears in the same position in other word clouds.
该实例体现了本发明所提供的事件可视化展现方法的特点,可以协助读者快速、全面的了解事件的主要内容及随事件演变的过程,同时也可以使读者通过子事件词云之间的对比快速了解事件之间的差异。This example embodies the characteristics of the event visual display method provided by the present invention, which can assist readers to quickly and comprehensively understand the main content of the event and the process of evolution with the event, and at the same time enable the reader to quickly compare the sub-event word clouds. Know the difference between events.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410354273.6A CN104536956A (en) | 2014-07-23 | 2014-07-23 | A Microblog platform based event visualization method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410354273.6A CN104536956A (en) | 2014-07-23 | 2014-07-23 | A Microblog platform based event visualization method and system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN104536956A true CN104536956A (en) | 2015-04-22 |
Family
ID=52852484
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410354273.6A Pending CN104536956A (en) | 2014-07-23 | 2014-07-23 | A Microblog platform based event visualization method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104536956A (en) |
Cited By (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104933129A (en) * | 2015-06-12 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | Event context acquisition method and system based on micro-blogs |
| CN105159998A (en) * | 2015-09-08 | 2015-12-16 | 海南大学 | Keyword calculation method based on document clustering |
| CN106484724A (en) * | 2015-08-31 | 2017-03-08 | 富士通株式会社 | Information processor and information processing method |
| CN106528624A (en) * | 2016-09-30 | 2017-03-22 | 财付通支付科技有限公司 | Information display method and device |
| CN106874419A (en) * | 2017-01-22 | 2017-06-20 | 北京航空航天大学 | A kind of real-time focus polymerization of many granularities |
| CN106886576A (en) * | 2017-01-22 | 2017-06-23 | 广东广业开元科技有限公司 | It is a kind of based on the short text keyword extracting method presorted and system |
| CN107741929A (en) * | 2017-10-18 | 2018-02-27 | 网智天元科技集团股份有限公司 | The analysis of public opinion method and device |
| CN107918644A (en) * | 2017-10-31 | 2018-04-17 | 北京锐思爱特咨询股份有限公司 | News subject under discussion analysis method and implementation system in reputation Governance framework |
| CN108170830A (en) * | 2018-01-10 | 2018-06-15 | 清华大学 | Group event data visualization method and system |
| CN108376175A (en) * | 2018-03-02 | 2018-08-07 | 成都睿码科技有限责任公司 | Visualization method for displaying news events |
| CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
| CN108595388A (en) * | 2018-04-23 | 2018-09-28 | 乐山师范学院 | A kind of chronicle of events automatic generation method of network-oriented news report |
| CN108733791A (en) * | 2018-05-11 | 2018-11-02 | 北京科技大学 | network event detection method |
| CN109063198A (en) * | 2018-09-10 | 2018-12-21 | 浙江广播电视集团 | Melt the multidimensional visual search recommender system of media resource |
| CN111858908A (en) * | 2020-03-03 | 2020-10-30 | 北京市计算中心 | Method and device for generating newspaper picking text, server and readable storage medium |
| CN112417026A (en) * | 2020-09-23 | 2021-02-26 | 郑州大学 | Urban waterlogging early warning rainstorm threshold dividing method based on crowd-sourcing waterlogging feedback |
| CN113157908A (en) * | 2021-03-22 | 2021-07-23 | 北京邮电大学 | Text visualization method for displaying hot sub-topics of social media |
| CN114265962A (en) * | 2021-11-26 | 2022-04-01 | 航天信息股份有限公司 | Method and system for analyzing target event based on social topic |
| CN115221975A (en) * | 2022-08-10 | 2022-10-21 | 平安科技(深圳)有限公司 | Artificial intelligence-based word cloud generation method and related equipment |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103324665A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Hot spot information extraction method and device based on micro-blog |
| US20140019119A1 (en) * | 2012-07-13 | 2014-01-16 | International Business Machines Corporation | Temporal topic segmentation and keyword selection for text visualization |
| CN103631862A (en) * | 2012-11-02 | 2014-03-12 | 中国人民解放军国防科学技术大学 | Event characteristic evolution excavation method and system based on microblogs |
| CN103793481A (en) * | 2014-01-16 | 2014-05-14 | 中国科学院软件研究所 | Microblog word cloud generating method based on user interest mining and accessing supporting system |
-
2014
- 2014-07-23 CN CN201410354273.6A patent/CN104536956A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140019119A1 (en) * | 2012-07-13 | 2014-01-16 | International Business Machines Corporation | Temporal topic segmentation and keyword selection for text visualization |
| CN103631862A (en) * | 2012-11-02 | 2014-03-12 | 中国人民解放军国防科学技术大学 | Event characteristic evolution excavation method and system based on microblogs |
| CN103324665A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Hot spot information extraction method and device based on micro-blog |
| CN103793481A (en) * | 2014-01-16 | 2014-05-14 | 中国科学院软件研究所 | Microblog word cloud generating method based on user interest mining and accessing supporting system |
Non-Patent Citations (3)
| Title |
|---|
| 单月光: "基于微博的网络舆情关键技术的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
| 邱云飞等: "微博突发话题检测方法研究", 《计算机工程》 * |
| 黄珊珊: "基于用户行为的微博信息聚合可视化系统设计和实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104933129B (en) * | 2015-06-12 | 2019-04-30 | 百度在线网络技术(北京)有限公司 | Event train of thought acquisition methods and system based on microblogging |
| CN104933129A (en) * | 2015-06-12 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | Event context acquisition method and system based on micro-blogs |
| US10324989B2 (en) | 2015-06-12 | 2019-06-18 | Baidu Online Network Technology (Beijing) Co., Ltd | Microblog-based event context acquiring method and system |
| CN106484724A (en) * | 2015-08-31 | 2017-03-08 | 富士通株式会社 | Information processor and information processing method |
| CN105159998A (en) * | 2015-09-08 | 2015-12-16 | 海南大学 | Keyword calculation method based on document clustering |
| CN106528624A (en) * | 2016-09-30 | 2017-03-22 | 财付通支付科技有限公司 | Information display method and device |
| CN106874419A (en) * | 2017-01-22 | 2017-06-20 | 北京航空航天大学 | A kind of real-time focus polymerization of many granularities |
| CN106886576A (en) * | 2017-01-22 | 2017-06-23 | 广东广业开元科技有限公司 | It is a kind of based on the short text keyword extracting method presorted and system |
| CN106886576B (en) * | 2017-01-22 | 2018-04-03 | 广东广业开元科技有限公司 | It is a kind of based on the short text keyword extracting method presorted and system |
| CN106874419B (en) * | 2017-01-22 | 2019-09-10 | 北京航空航天大学 | A kind of real-time hot spot polymerization of more granularities |
| CN107741929A (en) * | 2017-10-18 | 2018-02-27 | 网智天元科技集团股份有限公司 | The analysis of public opinion method and device |
| CN107918644A (en) * | 2017-10-31 | 2018-04-17 | 北京锐思爱特咨询股份有限公司 | News subject under discussion analysis method and implementation system in reputation Governance framework |
| CN108170830B (en) * | 2018-01-10 | 2020-07-31 | 华控清交信息科技(北京)有限公司 | Group event data visualization method and system |
| CN108170830A (en) * | 2018-01-10 | 2018-06-15 | 清华大学 | Group event data visualization method and system |
| CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
| CN108376175A (en) * | 2018-03-02 | 2018-08-07 | 成都睿码科技有限责任公司 | Visualization method for displaying news events |
| CN108595388A (en) * | 2018-04-23 | 2018-09-28 | 乐山师范学院 | A kind of chronicle of events automatic generation method of network-oriented news report |
| CN108733791A (en) * | 2018-05-11 | 2018-11-02 | 北京科技大学 | network event detection method |
| CN109063198B (en) * | 2018-09-10 | 2022-02-11 | 浙江广播电视集团 | Multi-dimensional visual search recommendation system for fusing media resources |
| CN109063198A (en) * | 2018-09-10 | 2018-12-21 | 浙江广播电视集团 | Melt the multidimensional visual search recommender system of media resource |
| CN111858908A (en) * | 2020-03-03 | 2020-10-30 | 北京市计算中心 | Method and device for generating newspaper picking text, server and readable storage medium |
| CN112417026A (en) * | 2020-09-23 | 2021-02-26 | 郑州大学 | Urban waterlogging early warning rainstorm threshold dividing method based on crowd-sourcing waterlogging feedback |
| CN112417026B (en) * | 2020-09-23 | 2022-10-25 | 郑州大学 | Urban waterlogging early warning rainstorm threshold dividing method based on crowd-sourcing waterlogging feedback |
| CN113157908A (en) * | 2021-03-22 | 2021-07-23 | 北京邮电大学 | Text visualization method for displaying hot sub-topics of social media |
| CN114265962A (en) * | 2021-11-26 | 2022-04-01 | 航天信息股份有限公司 | Method and system for analyzing target event based on social topic |
| CN115221975A (en) * | 2022-08-10 | 2022-10-21 | 平安科技(深圳)有限公司 | Artificial intelligence-based word cloud generation method and related equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104536956A (en) | A Microblog platform based event visualization method and system | |
| CN109829089B (en) | Social network user anomaly detection method and system based on associated graph | |
| Salloum et al. | Analysis and classification of Arabic newspapers’ Facebook pages using text mining techniques | |
| Salloum et al. | Mining social media text: extracting knowledge from Facebook | |
| Gokulakrishnan et al. | Opinion mining and sentiment analysis on a twitter data stream | |
| CN106960030B (en) | Information pushing method and device based on artificial intelligence | |
| CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
| US20130263019A1 (en) | Analyzing social media | |
| CN104077417B (en) | People tag in social networks recommends method and system | |
| CN103793481B (en) | Microblog word cloud generating method based on user interest mining and accessing supporting system | |
| CN101593204A (en) | A Sentiment Analysis System Based on News Comment Webpage | |
| CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
| CN107544988B (en) | Method and device for acquiring public opinion data | |
| WO2014210184A2 (en) | Real-time and adaptive data mining | |
| CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
| CN111626050B (en) | Microblog emotion analysis method based on expression dictionary and emotion general knowledge | |
| CN109783815A (en) | A kind of various dimensions network public-opinion big data comparative analysis method | |
| CN107908749B (en) | Character retrieval system and method based on search engine | |
| CN101894129B (en) | Method of Video Topic Discovery Based on Online Video Sharing Website Structure and Video Description Text Information | |
| CN112395513A (en) | Public opinion transmission power analysis method | |
| Daouadi et al. | Organization vs. Individual: Twitter User Classification. | |
| Bok et al. | Efficient graph-based event detection scheme on social media | |
| Dastanwala et al. | A review on social audience identification on twitter using text mining methods | |
| Arafat et al. | Analyzing public emotion and predicting stock market using social media | |
| Nirmala et al. | Twitter data analysis for unemployment crisis |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150422 |