CN104463601A

CN104463601A - Method for detecting users who score maliciously in online social media system

Info

Publication number: CN104463601A
Application number: CN201410638173.6A
Authority: CN
Inventors: 尚明生; 蔡世民; 高见; 董宇蔚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2015-03-25

Abstract

The invention discloses a method for detecting users who score maliciously in an online social media system. The method for detecting users who score maliciously in the online social media system arms at scoring feedback. Firstly, clustering is conducted according to scores for products by users, and the normalized user confidence degree is calculated; secondly, the reliability degree of user scoring is calculated according to the user confidence degree to obtain a candidate list of the users who score maliciously; finally, candidate users who score maliciously are sorted in combination with the deviation degree of user scoring and product quality to obtain the final list of the users who score maliciously. The method has advantages in the aspects of calculation accuracy and efficiency and can be applied to large-scale online social media websites.

Description

A method for detecting malicious rating users in an online social media system

技术领域technical field

本发明涉及在线社会媒体系统中检测恶意评价用户的方法，特别涉及一种针对评分反馈的社会媒体系统中检测恶意评分用户的方法。The invention relates to a method for detecting malicious evaluation users in an online social media system, in particular to a method for detecting malicious evaluation users in a social media system aimed at rating feedback.

背景技术Background technique

Internet作为商务的载体，已成为必不可少的信息采集、传输和交换的工具，信息时代的到来为基于Internet的IT服务业注入了新的活力。其中社会媒体更是备受瞩目，已被公认为是21世界格局的新型经济模式和催化剂，有“朝阳产业、绿色产业”之称。社会媒体是网络化的新型经济活动，正以前所未有的速度迅猛发展着，已经成为国家增强经济竞争实力，赢得全球资源配置优势的有效手段。通过社会媒体人们不再是面对面的、看着实实在在的货物、靠纸介质单据(包括现金)进行买卖交易，而是通过网络呈现琳琅满目的商品信息，完善的物流配送系统和方便安全的资金结算系统进行交易。社会媒体中存在数以万计的电商和数以亿计消费者，如何建立有效的信誉评价机制、营造有序竞争的环境、合理引导消费者就显得格外重要。As the carrier of business, the Internet has become an indispensable tool for information collection, transmission and exchange. The advent of the information age has injected new vitality into the IT service industry based on the Internet. Among them, social media has attracted much attention and has been recognized as a new economic model and catalyst of the 21st world pattern, known as "sunrise industry and green industry". Social media is a new type of networked economic activity, which is developing rapidly at an unprecedented speed. It has become an effective means for the country to enhance its economic competitiveness and win the advantages of global resource allocation. Through social media, people are no longer face-to-face, looking at real goods, and relying on paper documents (including cash) for buying and selling transactions, but present a wide range of commodity information, perfect logistics distribution systems and convenient and safe fund settlement through the Internet. system to trade. There are tens of thousands of e-commerce companies and hundreds of millions of consumers in social media. How to establish an effective reputation evaluation mechanism, create an environment of orderly competition, and guide consumers reasonably is particularly important.

当前大部分的信誉评价系统都基于用户对产品进行评论或评分等信息，用户对所购买的产品发表评论或进行满意度评分表达了用户对某件产品的看法和满意程度。这些评论信息为厂家以及潜在消费者提供了宝贵的信息资源。厂家通过分析这些信息，可以及时了解市场现状以及消费者的反馈意见，潜在消费者也可以以此作为购买产品的重要参考依据。潜在的消费者是否决定购买产品，最直接也是最重要的参考依据往往是产品本身获得的评分的高低和评论内容的好坏。对于大型的社会媒体交易平台，为潜在用户推荐产品的推荐系统绝大多数都基于用户对产品的历史评分数据和评论内容。如果某个商品的大多数评论都是正面的，那么该用户有非常大的可能性购买该产品；如果大多数评论是负面的，那么这个商品几乎不会被购买。现实的情况下，某些不法商家为了增大自己的利益，雇佣一批人对某些商品进行恶意评论，其评论内容与商品实际价值不符，或恶意吹捧或恶意诋毁。恶意评分和评论信息影响了评论信息的参考价值，严重误导了消费者的选择，削弱了正常用户的评分和评论信息的存在意义，使得消费者逐渐丧失对社会媒体产品评价系统的信任，进而危及并最终损害了整个社会媒体行业。由此可见，信誉评价系统中的评分数据和评论信息的真实性和有效性对于社会媒体的良性竞争意义非凡，如何甄别出信誉评价系统中的恶意评分用户的重要程度不言而喻。Most of the current reputation evaluation systems are based on information such as user reviews or ratings on products. Users comment on purchased products or perform satisfaction ratings to express users' opinions and satisfaction with a certain product. These reviews provide valuable information resources for manufacturers and potential consumers. By analyzing this information, manufacturers can keep abreast of the current market situation and consumer feedback, and potential consumers can also use this as an important reference for purchasing products. Whether a potential consumer decides to buy a product or not, the most direct and important references are often the ratings obtained by the product itself and the quality of the review content. For large social media trading platforms, most of the recommendation systems for recommending products to potential users are based on users' historical rating data and review content for products. If most of the reviews for an item are positive, then the user has a very high probability of buying the product; if most of the reviews are negative, the item will hardly be purchased. In reality, in order to increase their own interests, some unscrupulous merchants hire a group of people to make malicious comments on certain products, and the content of the reviews does not match the actual value of the products, or maliciously tout or maliciously slander. Malicious ratings and review information affect the reference value of review information, seriously mislead consumers' choices, weaken the significance of normal users' ratings and review information, and make consumers gradually lose trust in social media product evaluation systems, thereby endangering And ultimately hurt the entire social media industry. It can be seen that the authenticity and validity of the scoring data and comment information in the reputation evaluation system are of great significance to the benign competition of social media, and the importance of how to identify malicious scoring users in the reputation evaluation system is self-evident.

为了检测出作弊评论或恶意评分的用户，目前主要有两种方法：In order to detect users with cheating comments or malicious ratings, there are currently two main methods:

第一种方法是人工标记。通过人为地观察评价用户的评分、评论内容以及其他评论行为，来判断用户是否属于作弊评论用户。但这种检测方法带有很强的主观性，而且由于需要处理的数据量大，人工方法很难真正应用于大规模的社会媒体系统中恶意评价用户的检测。The first method is manual labeling. By artificially observing the user's ratings, comment content and other comment behaviors, it is judged whether the user is a cheating comment user. However, this detection method is highly subjective, and due to the large amount of data that needs to be processed, it is difficult for manual methods to be truly applied to the detection of malicious evaluation users in large-scale social media systems.

第二种方法是利用计算机自动识别。首先标记典型的作弊评论用户，再通过机器学习算法对未标记的用户进行分类。比较典型的做法有两种，一是在有文字评论的评价中判断用户评论内容的相似性，另一种是计算用户评分与产品固有质量偏离程度。The second method is to use computer automatic identification. First mark the typical cheating comment users, and then classify the unmarked users through machine learning algorithms. There are two typical approaches. One is to judge the similarity of user comments in reviews with text comments, and the other is to calculate the degree of deviation between user ratings and product inherent quality.

比如，2011年EPL上发表的文章(A robust ranking algorithm to spamming.EPL,94(2011),48002.)中提出一种基于相关性的用户信誉排序算法检测恶意评分用户。该算法主要通过迭代策略同时计算用户信誉值和产品均值，并最终根据用户的信誉排序检测恶意评分用户。该算法的本质在于采用用户信誉对产品评分进行加权平均计算产品质量，实质上是根据用户评分值与产品固有质量的偏差来进行检测，偏差越大，说明用户成为恶意评分用户的可能性越大。这种方法虽然简单，但产品的固有质量本身是个不可衡量的值，不同用户对同一件产品的满意程度因人而异。一般情况下，产品质量用产品获得的所有评分的均值来代表的做法，客观上存在一定的误差，从而会导致检测准确度不高。另外，该算法在恶意评分用户比例特别大时候表现出很好的鲁棒性，但对于恶意评分用户比例和作弊用户评分比例都较小的真实评分系统效果不佳。For example, an article published on EPL in 2011 (A robust ranking algorithm to spamming. EPL, 94(2011), 48002.) proposed a correlation-based user reputation ranking algorithm to detect malicious scoring users. The algorithm mainly calculates the user reputation value and the product mean value through an iterative strategy, and finally detects malicious scoring users according to the ranking of user reputation. The essence of this algorithm is to use the user reputation to weight the product ratings to calculate the product quality. In essence, it detects the deviation between the user rating value and the inherent quality of the product. The greater the deviation, the greater the possibility of the user becoming a malicious rating user. . Although this method is simple, the inherent quality of the product itself is an immeasurable value, and different users' satisfaction with the same product varies from person to person. In general, the product quality is represented by the mean value of all the scores obtained by the product, which objectively has certain errors, which will lead to low detection accuracy. In addition, the algorithm shows good robustness when the proportion of malicious scoring users is particularly large, but it is not effective for the real scoring system where the proportion of malicious scoring users and the proportion of cheating users are small.

又如，2012年WWW会议论文(Spotting Fake Reviewer Groups in Consumer Reviews.WWW’12,2012,pp,191-200.)提出基于用户评论内容相似性的检测恶意评分用户的方法。该方法通过分析用户评论文本内容的相似度来检测作弊评论用户，若两条评论之间相似度很高，那么发表这两天评论的用户成为作弊评论用户的可能性越大。这种方法虽然能有效的检测出作弊评论者，但是需要对整个社会媒体系统中的评论内容进行文本分析，数据量大，处理效率低；另一方面，很多社会媒体系统中用户都不积极参加评论，而且即使参与评论也只有简短的文字，这使得基于评论内容的分析在很多系统中不能正常使用。而基于评分的系统是目前绝大多数系统都具备的，由于用户评价成本不高，因此参与的用户比较多，而基于评论文本的判别方法不能使用在这类系统中。As another example, the 2012 WWW conference paper (Spotting Fake Reviewer Groups in Consumer Reviews. WWW'12, 2012, pp , 191-200.) proposed a method for detecting malicious scoring users based on the similarity of user review content. This method detects cheating comment users by analyzing the similarity of user comment text content. If the similarity between two comments is high, the user who posted comments in the past two days is more likely to become a cheating comment user. Although this method can effectively detect cheating commenters, it needs textual analysis of the comment content in the entire social media system, which requires a large amount of data and low processing efficiency; on the other hand, users in many social media systems do not actively participate Comments, and even participating comments are only short text, which makes analysis based on comment content unusable in many systems. The system based on scoring is currently available in most systems. Since the cost of user evaluation is not high, there are more users involved, and the method of discrimination based on comment text cannot be used in this type of system.

随着社交网络的不断发展，2012年8月5日授权的美国专利US8176057公开了一种基于社交网络的用户信誉检测方法，通过高信誉用户的反馈来进行信誉值的传递，从而检测出低信誉的用户。虽然该方法可以有效的计算出用户信誉值，但是主要用于识别信誉较高的用户，对于恶意评分的用户检测准确性不高。With the continuous development of social networks, the US patent US8176057 authorized on August 5, 2012 discloses a user reputation detection method based on social networks, which transmits the reputation value through the feedback of high reputation users, thereby detecting low reputation User. Although this method can effectively calculate the user reputation value, it is mainly used to identify users with high reputation, and the accuracy of user detection for malicious scoring is not high.

综上所述，现有的方法还不能满足大多数社会媒体网站的实际需求，或者在识别准确性方面有偏差，或者不能高效的应用于实际检测，或者不适用于某些评价系统。To sum up, the existing methods cannot meet the actual needs of most social media sites, or have biases in recognition accuracy, or cannot be efficiently applied to actual detection, or are not suitable for some evaluation systems.

发明内容Contents of the invention

本发明的目的是提供一种适用于在线社会媒体系统中恶意评分用户检测的有效方法。本发明针对的是具有评分反馈的社会媒体系统，通过分析用户的评分值来检测恶意评分用户，避免了对用户评论文本内容分析和处理带来的超大计算量，提高检测效率同时准确性高。The purpose of the present invention is to provide an effective method suitable for malicious scoring user detection in online social media systems. The present invention is aimed at a social media system with score feedback, and detects malicious score users by analyzing the user's score value, avoiding the large amount of calculation caused by analyzing and processing user comment text content, improving detection efficiency and high accuracy.

本发明提供的解决其技术问题所采用的技术方案是一种在线社会媒体系统中检测恶意评分用户的方法，包括如下步骤：The technical solution adopted by the present invention to solve its technical problems is a method for detecting malicious scoring users in an online social media system, including the following steps:

步骤1：提取系统中的用户评分数据，对数据进行预处理，得到规范化的用户评分数据包括将用户ID、产品ID、用户对产品的评分，将这三类数据按照三元组(u,p,v)的形式存储；Step 1: Extract the user rating data in the system, preprocess the data, and obtain standardized user rating data including user ID, product ID, and user ratings for products, and divide these three types of data into triplets (u,p , v) in the form of storage;

步骤2：用户评分聚类，计算用户评分的置信度向量；Step 2: clustering of user ratings, and calculating the confidence vector of user ratings;

步骤2-1：针对同一种产品将给予相同评分的用户聚类为一组,；Step 2-1: Clustering users who give the same rating for the same product into one group;

步骤2-2：计算每位用户的置信度向量，该置信度向量的每个分量表示该用户对一种产品的信誉值，该信誉值为用户针对该产品所属聚类组大小与所有评价用户数的比值，该比值定义为从众比例值；Step 2-2: Calculate the confidence vector of each user. Each component of the confidence vector represents the reputation value of the user for a product. The ratio of numbers, which is defined as the herd ratio;

步骤3：根据步骤2总计算出的用户置信度向量，计算用户评分的可靠度，将最不可靠的N个用户视为恶意评分用户，生成恶意评分用户候选列表，其中N根据实际系统的用户评分比例以及检测精确度等因素进行设定；Step 3: According to the user confidence vector calculated in step 2, calculate the reliability of user ratings, regard the least reliable N users as malicious rating users, and generate a list of malicious rating user candidates, where N is based on the user ratings of the actual system Factors such as ratio and detection accuracy are set;

步骤4：根据恶意评分用户候选列表中用户评分与产品固有质量的偏离程度对恶意评分用户候选列表进行重新排序，选取偏离程度最大的M个用户，得到最终的恶意评分用户，其中M根据实际系统的用户评分比例以及检测精确度等因素进行设定。Step 4: Reorder the malicious rating user candidate list according to the degree of deviation between the user rating and the inherent quality of the product in the malicious rating user candidate list, and select M users with the largest deviation to obtain the final malicious rating user, where M is based on the actual system Factors such as the proportion of user ratings and detection accuracy are set.

其中，步骤1的具体步骤为：Among them, the specific steps of step 1 are:

步骤1-1：去除评分次数低于阈值K的用户，其中阈值K可以根据系统评分的情况以及具体检测的精细程度进行调节；Step 1-1: Remove users whose scoring times are lower than the threshold K, where the threshold K can be adjusted according to the system scoring situation and the fineness of specific detection;

步骤1-2：根据四舍五入的原则，对分数不为整数的评分进行整数离散化；Step 1-2: According to the principle of rounding, carry out integer discretization for scores whose scores are not integers;

步骤1-3：将用户ID,产品ID、用户对产品的评分数据按照三元组(u,p,v)的形式存储。Step 1-3: Store the user ID, product ID, and user rating data on the product in the form of triples (u, p, v).

所述步骤1中通常K值为8。In the step 1, usually the value of K is 8.

所述步骤2中各用户的置信度向量维数不一致，采用xml文件存储。In the step 2, the dimensions of the confidence vectors of each user are inconsistent, and are stored in an xml file.

所述步骤3的具体步骤为：The concrete steps of described step 3 are:

步骤3-1：计算每位用户置信度向量的平均值与方差，在计算平均值除以方差的大小，得到用户可靠度；Step 3-1: Calculate the average value and variance of each user's confidence vector, and divide the calculated average value by the size of the variance to obtain the user reliability;

步骤3-2：将所有用户按照可靠度大小升序排列，选取前N个用户，生成恶意评分用户候选列表。Step 3-2: Arrange all users in ascending order of reliability, select the top N users, and generate a list of malicious scoring user candidates.

所述步骤4的具体步骤为：The concrete steps of described step 4 are:

步骤4-1：计算个产品评分的平均值，该平均值视为产品的固有质量；Step 4-1: Calculate the average value of product ratings, which is regarded as the inherent quality of the product;

步骤4-2：计算步骤3得到的恶意评分用户候选列表中各用户针对各产品的固有质量偏离度，即用户对产品的评分与该产品固有质量的差值；Step 4-2: Calculate the inherent quality deviation degree of each user for each product in the maliciously rated user candidate list obtained in step 3, that is, the difference between the user's rating of the product and the product's inherent quality;

步骤4-3：计算各用户对各产品的固有质量偏离度绝对值，再对其求平均，得到该用户的评分偏离度；Step 4-3: Calculate the absolute value of the inherent quality deviation degree of each user for each product, and then average them to obtain the user's rating deviation degree;

步骤4-4：将各用户按照评分偏离度进行降序排列，选取前M个用户为最终恶意评分用户，生成恶意评分用户列表。Step 4-4: Arrange the users in descending order according to the score deviation, select the top M users as the final malicious scoring users, and generate a list of malicious scoring users.

本发明基于用户评分进行检测，一方面省去了处理文本的复杂工序，提高了检测效率，适用于几乎所有的评价系统，另一方面先检测出恶意评分用户候选集，再对候选集中的用户进行二次检测，这种操作使得本发明在识别准确性方面大大提高，尤其在用户评分数相对于产品总数比例较小且恶意评分用户数相对于所有用户数量比例不大的现实评价系统中检测效果十分出色。The invention detects based on user ratings. On the one hand, it saves the complicated process of processing texts, improves the detection efficiency, and is applicable to almost all evaluation systems. Performing secondary detection, this operation greatly improves the recognition accuracy of the present invention, especially in real evaluation systems where the ratio of user ratings to the total number of products is small and the number of malicious rating users is not large compared to the number of all users It works brilliantly.

附图说明Description of drawings

图1是本发明提供的一种适用于大规模社会媒体系统中检测恶意评分用户的方法的流程图。FIG. 1 is a flowchart of a method suitable for detecting malicious scoring users in a large-scale social media system provided by the present invention.

图2是本发明提供的生成用户置信度向量的处理流程图。Fig. 2 is a flow chart of the process of generating user confidence vectors provided by the present invention.

图3是本发明提供的用户可信度计算和恶意评分用户候选列表生成的流程图。Fig. 3 is a flowchart of user credibility calculation and malicious scoring user candidate list generation provided by the present invention.

图4是本发明提供的根据恶意评分用户候选列表中用户评分与产品固有质量偏离程度对恶意评分用户候选列表进行重新排序，得到最终的恶意评分用户的流程图。Fig. 4 is a flow chart of reordering the malicious scoring user candidate list according to the degree of deviation between user ratings and product inherent quality in the malicious scoring user candidate list to obtain the final malicious scoring user provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

下面将结合附图对本发明加以详细说明，应指出的是，所描述的实例仅旨在便于对本发明的理解，而对其不起任何限定作用。The present invention will be described in detail below in conjunction with the accompanying drawings. It should be noted that the described examples are only intended to facilitate the understanding of the present invention, and have no limiting effect on it.

本发明提出的基于评分行为聚类的社会媒体中恶意评分用户检测方法，总体流程如图1所示。The overall process of the method for detecting malicious scoring users in social media based on scoring behavior clustering proposed by the present invention is shown in FIG. 1 .

步骤1为数据预处理模块。该模块将对系统输入的原始数据进行预处理，过滤噪音数据，并对评分数据进行离散整数化，预处理后的数据为步骤S2中特征提取操作的输入。Step 1 is the data preprocessing module. This module will preprocess the original data input by the system, filter the noise data, and perform discrete integerization on the scoring data. The preprocessed data is the input of the feature extraction operation in step S2.

步骤2为用户置信度计算模块。该模块对经步骤S1预处理后的数据进行评分聚类，根据聚类大小的从众比例计算用户置信度向量,用户置信度将作为步骤S3中二级特征提取的输入数据。Step 2 is the user confidence calculation module. This module performs score clustering on the data preprocessed in step S1, and calculates the user confidence vector according to the conformity ratio of the cluster size, and the user confidence will be used as the input data for the secondary feature extraction in step S3.

步骤3为计算用户评分的可靠度，生成恶意评分用户候选列表模块。该模块基于用户置信度向量提取每个用户置信度的平均值与方差，计算均值与方差的比值作为用户可靠度，对用户可靠度进行排序，最终生成恶意评分用户候选列表。Step 3 is to calculate the reliability of user ratings and generate a malicious rating user candidate list module. Based on the user confidence vector, the module extracts the mean and variance of each user's confidence, calculates the ratio of the mean and variance as the user reliability, sorts the user reliability, and finally generates a list of malicious scoring user candidates.

步骤4为恶意评分用户候选列表重排序，最终的恶意评分用户生成模块。该模块将计算产品固有质量，在恶意评分用户候选列表的基础上，利用用户评分与产品固有质量的偏离度，对初步检测结果进行二阶段检测，生成最终的恶意评分用户检测结果。Step 4 reorders the candidate list of malicious scoring users, and the final malicious scoring user generation module. This module will calculate the inherent quality of the product, and on the basis of the candidate list of malicious scoring users, use the degree of deviation between the user ratings and the inherent quality of the product to perform a second-stage detection on the preliminary detection results to generate the final malicious scoring user detection results.

接下来详细叙述各主要步骤：The main steps are described in detail below:

1.输入系统原始评价数据，并对输入数据进行数据预处理，对预处理后的结果进行存储。(步骤1)。1. Input the original evaluation data of the system, perform data preprocessing on the input data, and store the preprocessed results. (step 1).

预处理工作包括噪音数据过滤和评分值整数化两个主要部分。首选从输入的原始数据中分离出用户评论数据，过滤评论次数在8次以下的用户以及对应的评分信息。若用户评分不为整数，基于四舍五入的原则对用户评分取整。由于噪音数据是评分较少的用户以及评分信息,去掉之后对整个系统影响不大,但又有效的提高了计算效率。通过评分整数离散化，减少了聚类计算复杂度，更易于实际系统的应用。The preprocessing work includes two main parts: noise data filtering and score value integerization. It is preferred to separate the user comment data from the input raw data, and filter the users whose comments are less than 8 times and the corresponding rating information. If the user's rating is not an integer, the user's rating will be rounded up based on the rounding principle. Since the noise data is the users with less ratings and the rating information, it has little effect on the whole system after being removed, but it can effectively improve the calculation efficiency. Through the discretization of scoring integers, the complexity of clustering calculations is reduced, and it is easier to apply to practical systems.

2.用户评分聚类，计算用户评分的置信度向量。2. Clustering of user ratings to calculate the confidence vector of user ratings.

步骤2主要完成了用户置信度向量计算的工作，工作流程图如图2所示，包括评分行为聚类、从众比例计算以及用户置信度向量的生成和存储。Step 2 mainly completes the calculation of the user confidence vector. The workflow is shown in Figure 2, including clustering of scoring behaviors, calculation of conformity ratio, and generation and storage of user confidence vectors.

步骤2-1中对评分行为进行聚类是依据对相同产品评分的用户若评价分数相同则将这些用户聚类为一组。对系统中每个产品都要进行用户评分行为聚类。若用户对N个产品进行过评分，那么该用户置信度为一个N维向量，每个分量为用户每次评分后获得的信誉值。由于预处理后评分分数是离散的，所以聚类之后形成固定数目的群组。The clustering of the rating behavior in step 2-1 is based on clustering the users who rated the same product into a group if the rating scores are the same. For each product in the system, user rating behavior clustering is performed. If the user has rated N products, then the user confidence is an N-dimensional vector, and each component is the reputation value obtained by the user after each rating. Since the scoring scores are discrete after preprocessing, a fixed number of groups are formed after clustering.

${G G}_{j j}^{((r r))} = = {{{U u}_{i i} | | {r r}_{i i,, j j} = = r r,, r r &Element; &Element; Rate Rate}}$

其中对产品O_j评分进行聚类后形成的群组，r_i,j表示用户U_i对产品O_j的打分。Rate是产品的整数化离散评分区间。in The group formed by clustering the ratings of product O _j , r _{i, j} represents the rating of user U _i on product O _j . Rate is the integer discrete rating range of the product.

步骤2-2是计算每个用户所属群组大小占评价产品总用户人数的比例大小，比例越大说明从众性越强。这个比例值反应了用户评价行为与大多数人的偏离程度。若用户属于一个较小的群组，那么比例值小，用户评价行为偏离大众评价较大。相反，若用户属于一个较大的群组，说明评论与绝大多数人的评论一致，偏离程度小，值得信任。系统采用聚类大小归一化的方法计算从众强度。生成并存储用户置信度向量。根据用户所属群组以及该群组分配的从众比例来分配用户置信度，置信度大小表征用户所属群组的从众强度大小。通过对给予相同产品进行评分且被聚为一组的用户给予相同的置信度，得到每个用户对于每个进行过评分的产品的置信度，生成用户的置信度向量。最后，存储用户对应的置信度向量。Step 2-2 is to calculate the ratio of the size of the group to which each user belongs to the total number of users who evaluate the product. The larger the ratio, the stronger the conformity. This ratio value reflects the deviation degree of user evaluation behavior from most people. If the user belongs to a small group, then the proportion value is small, and the user evaluation behavior deviates greatly from the public evaluation. On the contrary, if the user belongs to a larger group, it means that the comments are consistent with those of the vast majority of people, and the degree of deviation is small, which is trustworthy. The system uses the method of cluster size normalization to calculate the herd strength. Generate and store a user confidence vector. The user confidence is assigned according to the group to which the user belongs and the conformity ratio assigned by the group, and the degree of confidence represents the conformity strength of the group to which the user belongs. By giving the same confidence to users who rate the same product and are clustered into a group, the confidence of each user for each product that has been rated is obtained, and the user's confidence vector is generated. Finally, the confidence vector corresponding to the user is stored.

3.计算用户评分的可靠度，生成恶意评分用户候选列表。(步骤S3)3. Calculate the reliability of user ratings and generate a candidate list of users with malicious ratings. (step S3)

步骤3是在步骤S2生成的用户置信度向量的基础上计算用户可靠度，并根据用户可靠度大小进行排序，取排名前百分之K的用户加入恶意评分用户候选列表中。步骤S3的流程图如图3所示，包括置信度平均值和方差计算(步骤S31)、计算用户可靠度(步骤S32)以及恶意评分用户候选集生成和存储。Step 3 is to calculate user reliability based on the user confidence vector generated in step S2, and sort according to the size of user reliability, and take the top K percent users and add them to the malicious scoring user candidate list. The flow chart of step S3 is shown in FIG. 3 , including calculation of confidence average and variance (step S31 ), calculation of user reliability (step S32 ), generation and storage of user candidate sets with malicious scores.

步骤3-1中，提取每个用户所有置信度的平均值和方差。置信度的平均值反应了该用户可靠度的平均水平，置信度的方差反应了该用户可靠度的波动程度。用户可信度是在用户置信度的基础上进一步计算得到的最终的可信度大小。利用用户的平均可靠度和可靠度波动程度来综合生成用户的可靠度，具体计算方法如下公式所示：In step 3-1, extract the mean and variance of all confidences for each user. The average value of the confidence reflects the average level of the user's reliability, and the variance of the confidence reflects the fluctuation degree of the user's reliability. The user credibility is the final credibility obtained by further calculation on the basis of the user confidence. Use the user's average reliability and reliability fluctuation to comprehensively generate the user's reliability. The specific calculation method is shown in the following formula:

${Score}_{i} = \frac{{Rs}_{i}}{{Ps}_{i}},$ 其中 ${Rs}_{i} = \frac{Σ_{j &Element; O_{i &Element; U}} {rp}_{i, j}}{\dim {\overset{&RightArrow;}{rp}}_{i}}, {Ps}_{i} = \frac{Σ_{j &Element; O_{i &Element; U}} {({rp}_{i, j} - {Rs}_{i})}^{2}}{\dim {rp}_{i}};$ ${Score}_{i} = \frac{{Rs.}_{i}}{{PS}_{i}},$ in ${Rs.}_{i} = \frac{Σ_{j &Element; o_{i &Element; u}} {rp}_{i, j}}{\dim {\overset{&Right Arrow;}{rp}}_{i}}, {PS}_{i} = \frac{Σ_{j &Element; o_{i &Element; u}} {({rp}_{i, j} - {Rs.}_{i})}^{2}}{\dim {rp}_{i}};$

其中Score_i是用户U_i的可靠度，即用户最终的可信赖度大小；Rs_i是用户置信度的平均值，代表用户可靠度的平均水平；Ps_i是用户置信度的方差，代表用户可靠程度的波动程度。当平均值较大同时方差较小时，用户每次能够稳定的获得高分值的评分信誉，这类用户可靠度高，是值得信赖的。Among them, Score _i is the reliability of user U _i , that is, the final trustworthiness of the user; Rs _i is the average value of user confidence, representing the average level of user reliability; Ps _i is the variance of user confidence, representing user reliability degree of volatility. When the average value is large and the variance is small, the user can stably obtain a high-score rating reputation every time. This type of user has high reliability and is trustworthy.

步骤3-2中，生成恶意评分用户候选列表是在步骤S32中计算出的用户可靠度的基础上进行的。对用户可靠度进行升序排列，取列表前百分之K的用户加入恶意评分用户候选集合，完成初步检测。排序算法采用成熟的快速排序，该算法不属于本发明强调的内容，当数据量较大时，该排序算法可以很好的分布式化，提高排序效率。In step 3-2, generating the user candidate list with malicious rating is performed on the basis of the user reliability calculated in step S32. Sort the reliability of users in ascending order, and take the top K percent users in the list to join the user candidate set for malicious scoring to complete the preliminary detection. The sorting algorithm adopts a mature quick sorting algorithm, which does not belong to the content emphasized in the present invention. When the amount of data is large, the sorting algorithm can be well distributed and improve the sorting efficiency.

4.根据恶意评分用户候选列表中用户评分与产品固有质量偏离程度对恶意评分用户候选列表进行重新排序，得到最终的恶意评分用户。4. According to the degree of deviation between user ratings and product inherent quality in the malicious scoring user candidate list, the malicious scoring user candidate list is reordered to obtain the final malicious scoring user.

步骤4的整体流程如图4所示，主要有提取产品固有质量(步骤S41)、计算评分偏离度(步骤S42)、对用户评分与产品固有质量偏离平均值进行排序(步骤S43)以及最终生成恶意评分用户检测结果。The overall process of step 4 is shown in Figure 4, which mainly includes extracting the inherent quality of the product (step S41), calculating the deviation degree of the rating (step S42), sorting the deviation between the user rating and the inherent quality of the product (step S43), and finally generating Malicious scoring user detection results.

步骤4-1中产品固有质量通过产品的所有评分均值来衡量。品固有质量本身是一个不可衡量的量，通常通过一些算法来对产品固有质量做出估计。本发明中采取计算每个产品获得的评分的算数平均值作为产品的固有质量，平均值越大说明产品自身质量越好，反之，则产品自身质量越差。In step 4-1, the inherent quality of the product is measured by the mean value of all ratings of the product. The inherent quality of the product itself is an immeasurable quantity, and some algorithms are usually used to estimate the inherent quality of the product. In the present invention, the arithmetic average of the scores obtained by calculating each product is used as the inherent quality of the product. The larger the average value, the better the quality of the product itself, otherwise, the worse the quality of the product itself.

步骤4-2中计算评分偏离度是一个可离线计算的过程。计算用户评分与产品固有质量的差值的绝对值作为用户的评分偏离度。The calculation of score deviation in step 4-2 is a process that can be calculated offline. Calculate the absolute value of the difference between the user's rating and the inherent quality of the product as the user's rating deviation.

步骤4-3中对用户所有评价过的产品做同样的处理得到该用户的评分偏离度向量，最后计算评分偏离度向量的平均值作为最终的用户评分偏离度。同样的，对步骤3生成的恶意评分用户候选集中的所有用户做上述相同的处理。以下提及的“用户评分偏离度”均指经过对用户打分偏差进行绝对值平均后的值。In step 4-3, do the same process for all the products evaluated by the user to obtain the user's rating deviation vector, and finally calculate the average value of the rating deviation vector as the final user rating deviation. Similarly, do the same processing above for all users in the user candidate set with malicious ratings generated in step 3. The "user rating deviation" mentioned below refers to the value after the absolute value averaging of the user rating deviation.

步骤4-4中，基于步骤4-2中根据用户评分偏离度对恶意评分用户候选集中的用户进行降序排列，排名越靠前的用户评分偏离度越大，成为恶意评分用户的可能性越大，取排名前K的用户生成最终的恶意评分用户列表。K值可以根据实际系统的用户评分比例以及检测精确度等因素进行调整。由此得到最终的排序结果就是恶意评分用户检测结果。In step 4-4, based on the degree of deviation of user ratings in step 4-2, the users in the user candidate set with malicious ratings are sorted in descending order. The higher the ranking, the greater the deviation of ratings, and the greater the possibility of becoming a user with malicious ratings , take the top K users to generate the final malicious scoring user list. The K value can be adjusted according to factors such as the user rating ratio and detection accuracy of the actual system. The final sorting result thus obtained is the malicious scoring user detection result.

下面以一个实际例子来说明本发明的执行的过程The process of the execution of the present invention will be described below with a practical example

为简化说明，本实例中，社会媒体网站系统中一共有10个用户对5个产品的原始评价情况，评分为1分到5分，共5个评分等级。如表1所示，表1中行表示用户(U)，列表示产品(O)，对应的单元格中的值为用户对产品的评分，若单元格为空(-)，表示该用户没有购买过该产品，评分为空。这样构成表1的用户产品评分矩阵R。To simplify the description, in this example, there are a total of 10 users' original evaluations on 5 products in the social media website system, and the ratings range from 1 to 5, with 5 rating levels in total. As shown in Table 1, the row in Table 1 represents the user (U), the column represents the product (O), and the value in the corresponding cell is the user's rating of the product. If the cell is empty (-), it means that the user has not purchased passed this product, the rating is empty. In this way, the user-product rating matrix R in Table 1 is formed.

O1O1 O2O2 O3O3 O4O4 O5O5 U1U1 44 55 33 44 -- U2U2 -- 44 44 22 55 U3U3 33 44 -- 55 33 U4U4 55 -- -- 44 33 U5U5 33 44 55 -- 33 U6U6 22 44 33 55 33 U7U7 -- 33 11 55 33

U8U8 11 -- 33 33 44 U9U9 55 22 22 55 -- U10U10 55 -- 22 11 44

表1Table 1

为简化说明，此处仅利用基于评分行为聚类的一种具体实现为例进行说明，其中聚类方法根据步骤S21中具体描述进行，得到聚类后的评分群组。行是对应表1中的5个产品，列是1到5个评分区间，对应单元格是根据评分行为聚类后形成的群组，单元格为空(-)表示没有用户评过对应分数，数值表示聚类后群组的大小，如表2所示。To simplify the description, here only a specific implementation of clustering based on rating behavior is used as an example for illustration, wherein the clustering method is performed according to the specific description in step S21 to obtain clustered rating groups. Rows correspond to the 5 products in Table 1, and columns correspond to rating ranges from 1 to 5. The corresponding cells are groups formed by clustering based on rating behavior. A cell that is empty (-) indicates that no user has rated the corresponding score. The value indicates the size of the group after clustering, as shown in Table 2.

表2Table 2

对每一个产品所得到的用户聚类根据聚类大小归一化得到用户置信度，对给予相同产品进行评分且被聚为一组的用户给予相同的置信度。如表3所示，行表示10个用户，列表示5种产品，单元格是用户对产品评价置信度大小。若单元格为空，表明用户没有对该产品进行评分。The user clusters obtained for each product are normalized according to the cluster size to obtain user confidence, and the same confidence is given to users who score the same product and are clustered into a group. As shown in Table 3, the row represents 10 users, the column represents 5 products, and the cell is the confidence level of the user's evaluation of the product. If the cell is empty, it means that the user has not rated the product.

O1O1 O2O2 O3O3 O4O4 O5O5 U1U1 0.1250.125 0.1430.143 0.3750.375 0.2220.222 -- U2U2 -- 0.5710.571 0.1250.125 0.1110.111 0.1250.125

U3U3 0.2500.250 0.5710.571 -- 0.4440.444 0.6250.625 U4U4 0.3750.375 -- -- 0.2220.222 0.6250.625 U5U5 0.2500.250 0.5710.571 0.1250.125 -- 0.6250.625 U6U6 0.1250.125 0.5710.571 0.3750.375 0.4440.444 0.6250.625 U7U7 -- 0.1430.143 0.1250.125 0.4440.444 0.6250.625 U8U8 0.1250.125 -- 0.3750.375 0.1110.111 0.2500.250 U9U9 0.3750.375 0.1430.143 0.2500.250 0.4440.444 -- U10U10 0.3750.375 -- 0.2500.250 0.1110.111 0.2500.250

表3table 3

根据步骤S31计算每个用户置信度向量的高斯分布统计特征，得到平均值和方差。利用平均值比方差得到用户最终的可靠度，如表4所示，行表示用户，列表示用户置信度平均值、方差以及最终可靠度大小。Calculate the Gaussian distribution statistical features of each user's confidence vector according to step S31 to obtain the mean value and variance. The final reliability of the user is obtained by using the average value ratio, as shown in Table 4, the row represents the user, and the column represents the average value, variance, and final reliability of the user's confidence.

平均值average value 方差variance 可靠度Reliability u1u1 0.21630.2163 0.11390.1139 1.8991.899 u2u2 0.23300.2330 0.22540.2254 1.0341.034 u3u3 0.47250.4725 0.16660.1666 2.8362.836 u4u4 0.40730.4073 0.20340.2034 2.0022.002 u5u5 0.39270.3927 0.24340.2434 1.6131.613 u6u6 0.42800.4280 0.19630.1963 2.1802.180 u7u7 0.33420.3342 0.24290.2429 1.3761.376 u8u8 0.21530.2153 0.12350.1235 1.7431.743 u9u9 0.30300.3030 0.13350.1335 2.2692.269 u10u10 0.24650.2465 0.10790.1079 2.2852.285

表4Table 4

根据表4中计算得到的用户可靠度大小进行升序排序，结果是：u2,u7,u5,u8,u1,u4,u6,u9,u10,u3。取列表前40％的用户加入恶意评分用户候选集合，得到恶意评分用户候选集合为：{u2,u7,u5,u8}。Sort in ascending order according to the user reliability calculated in Table 4, and the result is: u2, u7, u5, u8, u1, u4, u6, u9, u10, u3. The top 40% users in the list are added to the user candidate set with malicious rating, and the user candidate set with malicious rating is obtained: {u2,u7,u5,u8}.

根据步骤S41，计算产品所得所有评分平均值作为产品固有质量，根据用户评分与产品固有质量的偏离度对恶意评分用户候选集进行重新排序。如表5所示，行表示用户，列表示产品。共4个用户5种产品，对应单元格为用户评分与产品固有质量的偏离度。According to step S41, the average value of all ratings obtained by the product is calculated as the inherent quality of the product, and the user candidate set with malicious ratings is reordered according to the degree of deviation between the user ratings and the inherent quality of the product. As shown in Table 5, rows represent users and columns represent products. There are a total of 4 users and 5 products, and the corresponding cell is the degree of deviation between the user rating and the inherent quality of the product.

表5table 5

对表5中计算得到的偏离度平均值降序排列，得到的恶意评分用户列表为U2,U7,U8,U5。与恶意评分候选集中评分列表相比，U8评分偏离比U5更多，更容易成为恶意评分用户。至此检测完毕，得到恶意评分用户列表，排名越靠前成为恶意评分用户的可能性越大。Arrange the average deviation degree calculated in Table 5 in descending order, and the obtained list of users with malicious ratings is U2, U7, U8, and U5. Compared with the scoring list in the malicious scoring candidate set, the U8 score deviates more than U5, and it is easier to become a malicious scoring user. At this point, the detection is completed, and the list of malicious scoring users is obtained. The higher the ranking, the greater the possibility of becoming a malicious scoring user.

尽管上面对本发明说明性的具体实施方式进行了描述，以便本技术领域的技术人员理解本发明。但应该清楚的是，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。Although the illustrative embodiments of the present invention have been described above, it is for those skilled in the art to understand the present invention. However, it should be clear that the present invention is not limited to the scope of specific embodiments. For those of ordinary skill in the art, as long as various changes are within the spirit and scope of the present invention defined and determined by the appended claims, these changes It is obvious that all inventions and creations using the concept of the present invention are included in the protection.

Claims

1. in online Social Media system detection of malicious scoring user a method, the method comprises:

Step 1: the user's score data in extraction system, pre-service is carried out to data, obtains normalized user's score data and comprise by user ID, product IDs, user to the scoring of product, by these three classes data according to tlv triple (u, p, v) form store;

Step 2: user marks cluster, calculates the degree of confidence vector of user's scoring;

Step 2-1: be one group by the user clustering giving identical scoring for same product;

Step 2-2: the degree of confidence vector calculating every user, this user of each representation in components of this degree of confidence vector is to a kind of credit value of product, and this credit value is for user is for the ratio value of comforming of phylogenetic group size belonging to this product and all evaluation numbers of users;

Step 3: according to the user's degree of confidence vector calculated in step 2, calculate the fiduciary level of user's scoring, be considered as least reliable N number of user maliciously to mark user, generate malice scoring user candidate list, wherein N to mark ratio and detect the factors such as degree of accuracy and set according to the user of real system;

Step 4: with the departure degree of product proper mass, user's candidate list of maliciously marking is resequenced according to user's scoring in malice scoring user candidate list, choose the maximum M of a departure degree user, obtain final malice scoring user, wherein M to mark ratio and detect the factors such as degree of accuracy and set according to the user of real system.

2. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of step 1 are:

Step 1-1: remove the user of scoring number of times lower than threshold k, wherein threshold k can regulate according to the situation of system scoring and the concrete fine degree detected;

Step 1-2: according to the principle rounded up, to mark not for integer discretize is carried out in the scoring of integer;

Step 1-3: by user ID, product IDs, user store the form of the score data of product according to tlv triple (u, p, v).

3. in a kind of online Social Media system as claimed in claim 2 detection of malicious scoring user method, it is characterized in that in described step 1-1, usual K value is 8.

4. the method for detection of malicious scoring user in a kind of online Social Media system as claimed in claim 1, is characterized in that the degree of confidence vector dimension of each user in described step 2 is inconsistent, adopts xml file to store.

5. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of described step 3 are:

Step 3-1: the mean value and the variance that calculate every user's degree of confidence vector, at calculating mean value divided by square extent, obtain user's fiduciary level;

Step 3-2: by all users according to the arrangement of fiduciary level size ascending order, choose top n user, generates malice scoring user candidate list.

6. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of described step 4 are:

Step 4-1: calculate a mean value of product scoring, this mean value is considered as the proper mass of product;

Step 4-2: in malice that calculation procedure 3 obtains scoring user candidate list, each user is for the proper mass irrelevance of each product, namely user is to the difference of the scoring of product and this product proper mass;

Step 4-3: calculate the proper mass irrelevance absolute value of each user to each product, then it is averaging, obtain the scoring irrelevance of this user;

Step 4-4: each user is carried out descending sort according to irrelevance of marking, chooses a front M user for the final user that maliciously marks, generates user list of maliciously marking.