CN108182181A

CN108182181A - A Method for Duplicative Detection of Public Contribution Merge Requests Based on Hybrid Similarity

Info

Publication number: CN108182181A
Application number: CN201810100193.6A
Authority: CN
Inventors: 余跃; 李志星; 尹刚; 王涛; 王怀民; 范强; 於杰; 张迅晖; 胡东阳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2018-06-19
Anticipated expiration: 2038-02-01
Also published as: CN108182181B

Abstract

The invention belongs to the field of software collaborative development, and discloses a method for detecting repetition of public contribution merge requests based on mixed similarity. The method includes: for a newly submitted public contribution merge request, firstly calculate the textual similarity between it and the historical public contribution merge request; then calculate the change similarity between it and the historical public contribution; further collect a group on the popular collaborative development platform Historical repeated contribution data set, under the training of this data set, use the weight calculation method based on the greedy search strategy to combine the two similarities to calculate the mixed similarity between the public contributions; finally, according to the mixed similarity value, a Group with a given crowd-contributed merge request is most likely a duplicate of the list of historical crowd-contributed merge requests. The invention can timely detect the repetition of public contributions, avoid repeated manual code review work, and improve the efficiency of public contribution review.

Description

A Method for Duplicative Detection of Public Contribution Merge Requests Based on Hybrid Similarity

技术领域technical field

本发明属于软件协同开发领域，涉及一种基于混合相似度的大众贡献合并请求重复性检测方法。The invention belongs to the field of software collaborative development, and relates to a method for detecting repetition of public contribution merging requests based on mixed similarity.

背景技术Background technique

在开源社区(如GitHub)，基于大规模群体协同的软件开发模式大大提高了软件创新效率，激发了越来越多的开发者投身到开源软件的创作过程中。然而，这种开发模式是一种并行且无统一协调的过程，当多个开发者自发地对同一个开源软件项目贡献代码时，如果他们想要实现同样的目的，就有可能会提交重复的贡献合并请求(即Pull-request)，尤其是那些吸引了大量外围开发者，源源不断地接受到社区贡献的流行项目更容易出现这种问题。如图2所示，两个开发者Bob和Alice都克隆(fork)了同一个主版本库，然后两个开发者单独地在各自的本地克隆库上做修改。当他们都想实现同一个功能或者修复同一个代码缺陷时，由于他们都不知晓对方正在做的工作，两人可能都会做相应的修改然后提交合并请求到主版本库中，提交的两个合并请求都会各自经历贡献审查和更新操作，直到有某位开发者意识到这两个重复大众贡献合并请求的存在。In open source communities (such as GitHub), the software development model based on large-scale group collaboration has greatly improved the efficiency of software innovation and inspired more and more developers to devote themselves to the creation process of open source software. However, this development model is a parallel and uncoordinated process. When multiple developers spontaneously contribute code to the same open source software project, if they want to achieve the same purpose, they may submit duplicate Contributed merge requests (Pull-requests), especially those popular projects that attract a large number of peripheral developers and receive a steady stream of community contributions, are more prone to this problem. As shown in Figure 2, two developers, Bob and Alice, both clone (fork) the same master version library, and then the two developers independently make modifications on their respective local clone libraries. When they both want to implement the same function or fix the same code defect, since neither of them knows what the other is doing, both of them may make corresponding changes and then submit a merge request to the main version repository. Requests will go through contribution review and update actions separately, until a developer becomes aware of the existence of two duplicate crowd-contributed merge requests.

重复大众贡献合并请求造成了对平台资源的浪费，增加了平台的维护成本。同时也会导致对重复大众贡献合并请求执行重复的贡献审阅流程，这会耗费审查者额外的时间和精力。在一个大众贡献合并请求的生命周期内(从它被提交到平台到它被接受或拒绝这段时间)，重复的大众贡献合并请求可能在任何时间点被识别出来，而越晚被识别，它所造成的资源和精力浪费问题越严重。此外，在一个大众贡献合并请求的审查过程中，贡献者还经常会根据审查者的反馈对其进行更新完善，因此，如果不能尽早识别重复的大众贡献合并请求，两个贡献者也可能会做重复冗余的工作，进而对项目的管理团队的能力产生怀疑。尤其是如果他们提交的贡献合并请求被认为是一个晚提交的贡献合并请求的重复并被审查者关闭了，这对贡献者造成的负面影响更严重。Repeated public contribution merge requests cause a waste of platform resources and increase the maintenance cost of the platform. It also results in a duplicate contribution review process for duplicate crowd-contributed merge requests, which costs reviewers additional time and effort. During the lifetime of a crowd-contributed merge request (the period from when it is submitted to the platform until it is accepted or rejected), duplicate crowd-contributed merge requests may be identified at any point in time, and the later it is identified, it The resulting waste of resources and energy is more serious. Also, during the review process of a crowd-contributed merge request, contributors will often update it based on the reviewer's feedback, so if a duplicate crowd-contributed merge request is not identified early on, two contributors may also do Redundant work is repeated, which casts doubt on the competence of the project management team. Especially if their contributing pull request is considered a duplicate of a late contributing pull request and closed by a reviewer, which can negatively impact contributors even more.

目前GitHub平台(GitHub是一个面向开源及私有软件项目的托管平台，因为只支持git作为唯一的版本库格式进行托管，故名GitHub)上识别重复贡献合并请求的机制是依赖审查者人工地去发现。然而，对于那些流行的项目来说，大众开发者源源不断地往主版本库中提交代码贡献，大量的贡献合并请求需要代码审查。让某个审查者记住所有历史贡献合并请求的信息，并和新提交的贡献合并请求进行比对，然后判断重复性的做法是不现实的。在当前的机制下，只有当某个开发者碰巧意识到两个重复贡献合并请求的存在，它们的重复性才被发现，这就造成了大部分的重复贡献合并请求并不能被及时的识别出来。在上述状况下，一个能够在贡献合并请求提交阶段自动化地探测其重复性的工具是很必要的。首先，自动探测工具能够辅助审查者的工作，使他们避免做冗余的重复工作。其次，第一时间自动探测出重复贡献合并请求可以让双方贡献者尽早地建立联系并在一起协作，避免他们再各自继续做重复的工作。At present, the GitHub platform (GitHub is a hosting platform for open source and private software projects, because it only supports git as the only repository format for hosting, hence the name GitHub) is to rely on reviewers to manually find out the mechanism for identifying duplicate contribution merge requests. . However, for those popular projects, there is a steady stream of code contributions made by popular developers to the main repository, and a large number of contributed merge requests require code reviews. It is unrealistic for a reviewer to memorize information about all historically contributed pull requests, compare them to newly submitted contributed pull requests, and then judge duplication. Under the current mechanism, only when a developer happens to be aware of the existence of two duplicate contribution merge requests, their duplication is discovered, which causes most of the duplicate contribution merge requests to not be identified in time . In the above situation, a tool that can automatically detect duplicates during the contribution merge request submission stage is necessary. First, automated detection tools can assist reviewers in their work, allowing them to avoid redundant duplication of work. Secondly, automatically detecting duplicate contribution merge requests at the first time can allow contributors from both parties to establish contact and collaborate together as early as possible, preventing them from continuing to do repetitive work.

发明内容Contents of the invention

为解决上述技术问题，本发明提出了一种针对开源软件项目托管平台中可能存在的重复性贡献的基于混合相似度检测方法，具体技术方案如下。In order to solve the above technical problems, the present invention proposes a hybrid similarity detection method for repetitive contributions that may exist in the open source software project hosting platform, and the specific technical solution is as follows.

一种基于混合相似度的大众贡献合并请求重复性检测方法，包括以下步骤：A method for detecting duplication of public contribution merge requests based on mixed similarity, comprising the following steps:

S1、计算新提交的大众贡献合并请求与历史大众贡献合并请求间的文本相似度，所述文本相似度包括标题文本相似度和描述文本相似度；S1. Calculate the text similarity between the newly submitted public contribution merge request and the historical public contribution merge request, the text similarity includes title text similarity and description text similarity;

S2、计算新提交的大众贡献合并请求与历史大众贡献合并请求间的变更相似度，所述变更相似度指的是大众贡献合并请求所修改文件间的路径相似度；S2. Calculate the change similarity between the newly submitted public contribution merge request and the historical public contribution merge request, the change similarity refers to the path similarity between files modified by the public contribution merge request;

S3、在协同开发平台上搜集一组历史重复贡献合并请求数据集，采用基于贪心策略的权重搜索算法对所述历史重复贡献合并请求数据集进行训练，得到文本相似度和变更相似度的权重值，进一步根据权重值计算大众贡献合并请求间的混合相似度；S3. Collect a set of historical duplicate contribution merge request data sets on the collaborative development platform, and use a weight search algorithm based on a greedy strategy to train the historical duplicate contribution merge request data sets to obtain weight values of text similarity and change similarity , and further calculate the mixed similarity between public contribution merge requests according to the weight value;

S4、根据所述步骤S1至步骤S3，每一个历史大众贡献合并请求对应得到一个混合相似度，按照混合相似度值的大小进行排序，得到一组与新提交的大众贡献合并请求重复的历史大众贡献合并请求列表。S4. According to the above steps S1 to S3, each historical public contribution merge request corresponds to a mixed similarity, sorted according to the size of the mixed similarity value, and obtains a group of historical public that is duplicated with the newly submitted public contribution merge request Contribute to the list of merge requests.

进一步地，所述步骤S1的具体过程为：Further, the specific process of the step S1 is:

S11、分别从所述新提交的大众贡献合并请求与所述历史大众贡献合并请求中提取标题文本和描述文本，得到两个标题文本和两个描述文本；S11. Extract title text and description text from the newly submitted public contribution merge request and the historical public contribution merge request respectively, to obtain two title texts and two description texts;

S12、对标题文本和描述文本进行预处理；S12. Preprocessing the title text and description text;

S13、将经过预处理的标题文本和描述文本分别转换为多维向量，得到两个标题文本向量和两个描述文本向量；S13. Convert the preprocessed title text and description text into multi-dimensional vectors respectively, to obtain two title text vectors and two description text vectors;

S14、利用Cosine公式计算两个标题文本向量之间的相似度，即所述新提交的大众贡献合并请求与所述历史大众贡献合并请求的标题文本相似度；利用Cosine公式计算两个描述文本向量之间的相似度，即所述新提交的大众贡献合并请求与所述历史大众贡献合并请求的描述文本相似度。S14. Use the Cosine formula to calculate the similarity between the two title text vectors, that is, the title text similarity between the newly submitted public contribution merge request and the historical public contribution merge request; use the Cosine formula to calculate the two description text vectors The similarity between the newly submitted public contribution merge request and the description text similarity of the historical public contribution merge request.

进一步地，所述步骤S2的具体过程为：Further, the specific process of the step S2 is:

S21、分别提取出所述新提交的大众贡献合并请求与所述历史大众贡献合并请求具体修改了的文件，得到两个文件集合；S21. Extracting the files specifically modified by the newly submitted public contribution merge request and the historical public contribution merge request to obtain two file sets;

S22、计算两个文件集合中两两文件间的路径相似度，即新提交的大众贡献合并请求与所述历史大众贡献合并请求间的变更相似度。S22. Calculate the path similarity between two files in the two file sets, that is, the change similarity between the newly submitted public contribution merge request and the historical public contribution merge request.

进一步地，所述协同开发平台为GitHub平台。Further, the collaborative development platform is GitHub platform.

进一步地，所述步骤S12中对标题文本和描述文本进行预处理具体包括分词、转换词根和去除停用词。Further, the preprocessing of the title text and the description text in the step S12 specifically includes word segmentation, conversion of word roots and removal of stop words.

与现有技术相比，本发明具有以下有益效果：1、本发明针对开源软件项目托管平台中可能存在的重复性贡献，提出了一种基于混合相似度的探测方法。该方法是提高代码审阅效率的关键一环，能够避免审阅者重复的审阅工作，帮助核心开发者更高效地组织代码审阅过程，提高大众贡献汇聚效率。2、本发明提出了综合利用包括大众贡献合并请求标题和文本在内的文本相似度，以及所修改文件导致的变更相似度，计算大众贡献合并请求间的相似度，能够更好地揭示大众贡献合并请求间的重复性。3、本发明通过自动识别和人工检查的方式从GitHub平台搜集了一组历史重复大众贡献合并请求数据集，该数据及能够用于对重复大众贡献合并请求自动探测模型进行优化，提高其探测效果。4、本发明提出了利用基于贪心搜索的策略把两种相似度进行有效组合，从而计算出更加能反应大众贡献合并请求相似度的混合相似度值。Compared with the prior art, the present invention has the following beneficial effects: 1. The present invention proposes a detection method based on mixed similarity for possible repetitive contributions in the open source software project hosting platform. This method is a key part of improving the efficiency of code review. It can avoid repeated review work by reviewers, help core developers organize the code review process more efficiently, and improve the efficiency of public contribution aggregation. 2. The present invention proposes to comprehensively utilize the text similarity including the title and text of the public contribution merge request, and the change similarity caused by the modified file to calculate the similarity between the public contribution merge requests, which can better reveal the public contribution Duplication across merge requests. 3. The present invention collects a set of historical duplicate public contribution merge request data sets from the GitHub platform through automatic identification and manual inspection. This data can be used to optimize the automatic detection model of duplicate public contribution merge requests and improve its detection effect . 4. The present invention proposes to use a strategy based on greedy search to effectively combine the two similarities, so as to calculate a mixed similarity value that can better reflect the similarity of public contribution merge requests.

附图说明Description of drawings

图1为本发明方法流程示意图；Fig. 1 is a schematic flow sheet of the method of the present invention;

图2为背景技术中的多个开发者进行并行贡献的流程图；FIG. 2 is a flow chart of parallel contributions made by multiple developers in the background technology;

图3为本发明的大众贡献合并请求变更相似度计算算法程序代码图；Fig. 3 is a program code diagram of the public contribution merge request change similarity calculation algorithm of the present invention;

图4为两个文件的路径相似度计算算法程序代码图；Fig. 4 is the program code diagram of the path similarity calculation algorithm of two files;

图5为本发明中的基于贪心搜索策略的权重计算算法程序代码图；Fig. 5 is the program code diagram of the weight calculation algorithm based on the greedy search strategy among the present invention;

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示为本发明方法流程示意图；具体步骤如下：As shown in Figure 1, it is a schematic flow sheet of the method of the present invention; the concrete steps are as follows:

S1、计算新提交的大众贡献合并请求与历史大众贡献合并请求间的文本相似度，所述文本相似度包括标题文本相似度和描述文本相似度。S1. Calculate the text similarity between the newly submitted public contribution merge request and the historical public contribution merge request, the text similarity includes title text similarity and description text similarity.

对于从大众贡献合并请求标题和描述中抽取的文本，首先执行标准的预处理过程，包括分词、转换词根以及去除停用词。把一个句子切分成词组有多种现有技术中的策略可以采用，这依赖于要处理的数据类型和应用领域。有一些文本在通常的情境下会被切分成多个词，然而在大众贡献合并请求的上下文中应该被整体地看作一个词，例如表示代码路径和超链接的文本一般都很长，但是他们指代的却是一个完整的概念，因此他们不应该被切分开来。因此，我们使用了正则表达式分词器来解析原始文本，下面是一些正则表达式以及被其匹配的文本。For text extracted from crowd-contributed merge request titles and descriptions, standard preprocessing procedures are first performed, including tokenization, stemming, and removal of stop words. There are various strategies in the prior art for splitting a sentence into phrases, depending on the type of data to be processed and the domain of application. There is some text that would normally be split into multiple words, but should be treated as a single word in the context of a public contribution pull request, e.g. text representing code paths and hyperlinks are often long, but they refer to a whole concept, so they should not be cut apart. Therefore, we used a regular expression tokenizer to parse the original text, below are some regular expressions and the text matched by them.

代码路径：code path:

–\w+(？:\:\:\w+)*–\w+(?:\:\:\w+)*

–“ActionDispatch::Http::URL”– "ActionDispatch::Http::URL"

大众贡献合并请求在GitHub平台上的编号：Publicly contributed merge requests are numbered on the GitHub platform:

–\#\d+–\#\d+

–“#10319”– "#10319"

文本被分词后，每一个词都会被转换为词根形式(例如，“was”转换为“be”，“errors”转换为“error”)，这个转换是由Porter词根转换算法完成的。最后，一些经常出现但对一个句子的识别度没有太大贡献的停用词(如“the”、“a”)会被移除掉。After the text is tokenized, each word is converted to its root form (for example, "was" is converted to "be", "errors" is converted to "error"), and this conversion is done by the Porter root conversion algorithm. Finally, some stop words (such as "the", "a") that occur frequently but do not contribute much to the recognition of a sentence are removed.

经过预处理的文本会按照TF-IDF模型(Term Frequency–Inverse DocumentFrequency，缩写TF-IDF)被进一步转换为能够在向量空间模型(Vector Space Model，VSM)中计算的多维向量，向量化的文本i可以表示为:TextVec_i＝(w_i，1，w_i，2，...，w_i，v)，向量的每一维对应着文本的一个词，v代表的是整个文本语料库中词的总数。w_i，k的值是文本i对应的向量中第k个元素的权重，该值由TF-IDF模型计算得来:The preprocessed text will be further converted into a multidimensional vector that can be calculated in the Vector Space Model (Vector Space Model, VSM) according to the TF-IDF model (Term Frequency–Inverse Document Frequency, abbreviated as TF-IDF), and the vectorized text i Can be expressed as: TextVec _i =(wi _{, 1} , _{wi, 2} ,...,wi _{, v} ), each dimension of the vector corresponds to a word of the text, and what v represents is the word in the entire text corpus total. The value of w _{i, k} is the weight of the kth element in the vector corresponding to the text i, which is calculated by the TF-IDF model:

w_i，k＝tf_i，k×idf_i，k w _i,k =tf _i,k ×idf _i,k

在上面的式子中，tf_i，k表示词频，是文本i中第k个词出现的频率，idf_i，k表示逆文档频率，用于衡量一个词对文章的区分度。In the above formula, tf _i,k represents word frequency, which is the frequency of the kth word in text i, and idf _i,k represents inverse document frequency, which is used to measure the degree of distinction of a word to an article.

文本被向量化后，我们利用Cosine公式来计算两个文本向量TextVec_i和TextVec_j的相似度SimText(i，j)，具体计算公式为:After the text is vectorized, we use the Cosine formula to calculate the similarity SimText(i, j) of the two text vectors TextVec _i and TextVec _j . The specific calculation formula is:

基于Cosine计算公式，分别获得两个大众贡献合并请求之间的标题文本相似度SimText_title(i，j)和描述文本相似度SimText_desc(i，j)；i，j表示文本，|·|表示求向量的模。Based on the Cosine calculation formula, the title text similarity SimText _title (i, j) and the description text similarity SimText _desc (i, j) between two popular contribution merge requests are respectively obtained; i, j represent text, |·| Finds the modulus of a vector.

GitHub平台上的协作依赖于Git，有了Git工具的支持，当贡献者在GitHub平台提交一个贡献合并请求后，贡献所涉及到的修改能够以一种diff形式展现出来。要基于diff信息计算两个大众贡献合并请求的相似度，首先把原始的diff数据解析为结构化数据，从而提取出一个大众贡献合并请求具体修改了哪些模块和哪些文件。具体代码算法如图3所示，计算两个大众贡献合并请求之间的变更相似度。该算法的输入是两个大众贡献合并请求(算法中用PR表示)分别修改的文件集合files_i和files_j。图3中算法代码中第1行初始化了一个列表用于存放算法中产生的临时结果，第2行至第5行的代码用于计算两个文件集合中任意两个文件的文件路径相似度，并把两个文件和其相似度存放到列表中，而两个文件的文件路径相似度由图4所示的算法进行计算。第6行代码对列表中的元素按照相似度值进行排序，第7行代码确定最后要保留多少个文件对的相似度。第8行代码初始化了一个新的列表，用以存放最后要保留的文件对和其相似度值。第9行至第13行的代码从临时列表中依次找出相似度最大的文件对并把他们的相似度放到最终列表中，由于同一个文件在最终列表只出现一次，即同一个文件只会和另外一个文件具有最大相似度值，因此第12行代码会在中间结果列表中把最大相似度文件对中的文件与其它文件构成的文件对都删除掉。最后，将最终列表中的相似度值累加后除以两个更改文件集合规模的最大值，进而得到两个大众贡献合并请求的变更相似度。图4所示的算法用来计算两个文件的路径相似度。首先，该函数把两个文件的路径按照路径分隔符进行切分，分别得到两个目录名集合。然后通过第3行至第7行代码计算两个文件路径的最长公共子目录的深度，最后用该深度除以两个文件路径深度的最大值即是两个文件的路径相似度。Collaboration on the GitHub platform relies on Git. With the support of Git tools, when a contributor submits a contribution merge request on the GitHub platform, the changes involved in the contribution can be displayed in a diff form. To calculate the similarity between two public contribution merge requests based on the diff information, first parse the original diff data into structured data to extract which modules and which files have been specifically modified by a public contribution merge request. The specific code algorithm is shown in Figure 3, which calculates the change similarity between two public contribution merge requests. The input of the algorithm is the file sets files _i and files _j respectively modified by two popular contribution merge requests (indicated by PR in the algorithm). The first line of the algorithm code in Figure 3 initializes a list for storing the temporary results generated in the algorithm, and the codes from the second line to the fifth line are used to calculate the file path similarity of any two files in the two file sets. And store the two files and their similarity in a list, and the file path similarity of the two files is calculated by the algorithm shown in FIG. 4 . The 6th line of code sorts the elements in the list according to the similarity value, and the 7th line of code determines the similarity of how many file pairs to keep in the end. The eighth line of code initializes a new list to store the last file pairs to be kept and their similarity values. The codes from lines 9 to 13 find the file pairs with the highest similarity in turn from the temporary list and put their similarity into the final list. Since the same file only appears once in the final list, that is, the same file only It will have the maximum similarity value with another file, so the 12th line of code will delete all the file pairs formed by the maximum similarity file pair and other files in the intermediate result list. Finally, the sum of the similarity values in the final list is divided by the maximum size of the two changed file collections to obtain the change similarity of the two public contribution merge requests. The algorithm shown in Figure 4 is used to calculate the path similarity of two files. First, this function splits the paths of the two files according to the path separator, and obtains two sets of directory names respectively. Then calculate the depth of the longest common subdirectory of the two file paths through the codes in the 3rd to 7th lines, and finally divide the depth by the maximum value of the depth of the two file paths to get the path similarity of the two files.

本实施例中从GitHub平台上搜集历史重复大众贡献合并请求数据集的具体过程如下：In this embodiment, the specific process of collecting historical duplicate public contribution merge request data sets from the GitHub platform is as follows:

(1)随机抽样：选取了GitHub平台26个流行项目；对于一个项目，从它所有大众贡献合并请求中随机选取出一部分。(1) Random sampling: 26 popular projects on the GitHub platform are selected; for a project, a part is randomly selected from all its public contribution merge requests.

(2)人工筛选：对于被选出的每一个大众贡献合并请求，人工检查它的每一条包含其它大众贡献合并请求引用的评论，进一步挑选出关于大众贡献合并请求重复性的评论，本发明把这种关于大众贡献合并请求重复性的评论称之为指示性评论。(2) manual screening: for each selected public contribution merge request, each of its comments containing other public contribution merge request references is manually checked, and further selected are about the repeated comments of the public contribution merge request, the present invention puts This kind of comment about the repetitiveness of a merge request contributed by the general public is called an indicative comment.

(3)规则提取：基于上一步所收集到的指示性评论集合，发现评论者在指出一个大众贡献合并请求和另一个大众贡献合并请求是重复的时候，一些单词或短语被频繁使用。比如下面的几组评论中的“dup of”、“closed by”和“addressed in”都是经常被审阅者用来指出重复性的短语。(3) Rule extraction: Based on the set of indicative comments collected in the previous step, it is found that some words or phrases are frequently used when reviewers point out that one public contribution merge request is duplicated with another public contribution merge request. For example, "dup of", "closed by" and "addressed in" in the following groups of comments are all phrases that are often used by reviewers to point out duplication.

–“dup of#xxxx”– "dup of #xxxx"

–“Closed by https://github.com/rails/rails/pull/13867”– "Closed by https://github.com/rails/rails/pull/13867"

–“This has been addressed in#27768.”– “This has been addressed in #27768.”

因此，基于这些指示性评论抽取出正则表达式，将这些正则表达式作为规则用来自动匹配指示性评论。以下列出一部分规则的实例:Therefore, regular expressions are extracted based on these indicative comments, and these regular expressions are used as rules to automatically match indicative comments. Some examples of rules are listed below:

clos(e|ed|ing)(\w+){,5}(by|of)(\w+:？){,5}#\d+clos(e|ed|ing)(\w+){,5}(by|of)(\w+:?){,5}#\d+

(4)自动识别：按照上述正则表达式识别规则，可以自动地识别出指示性评论，从而发现相互重复的两个大众贡献合并请求。如果一个评论被识别为指示性评论，则会从这个评论里面提取被引用的大众贡献合并请求编号，与指示性评论所属的大众贡献合并请求组成一对重复大众贡献合并请求。(4) Automatic identification: according to the above-mentioned regular expression identification rules, the indicative comments can be automatically identified, so as to find two overlapping public contribution merge requests. If a comment is identified as an indicative comment, the referenced public contribution merge request number will be extracted from the comment, and a duplicate public contribution merge request will be formed with the public contribution merge request to which the indicative comment belongs.

(5)人工检查：按照规则进行自动识别会引入错误的数据，即存在一些大众贡献合并请求对不是相互重复的。因此需要对自动识别的数据进行人工检查。人工检查所依据的标准为：(5) Manual inspection: automatic identification according to the rules will introduce erroneous data, that is, there are some public contribution merge request pairs that are not duplicates of each other. Therefore, manual inspection of the automatically identified data is required. Manual checks are based on the following criteria:

1)重复大众贡献合并请求的作者不知晓源大众贡献合并请求的存在。这要求从两个大众贡献合并请求的评论数据中去观察判断作者是否知晓。1) The author of a duplicate crowd-contributed merge request is unaware of the existence of the source crowd-contributed merge request. This requires looking at the comment data of the two crowd-contributed merge requests to see if the authors were aware.

2)审查者们对大众贡献合并请求的重复性达成了一致。即一个审查者提出一个大众贡献合并请求是另一个大众贡献合并请求的重复之后，没有出现别的审查者持反对意见，而是表示了赞同并关闭了其中的一个大众贡献合并请求。2) The reviewers agreed on the duplication of crowd-contributed merge requests. That is, after a reviewer proposes that a public contribution merge request is a duplicate of another public contribution merge request, no other reviewer disagrees, but expresses approval and closes one of the public contribution merge requests.

另一方面，在计算完给定的大众贡献合并请求与历史大众贡献合并请求的各种类型的相似度后，利用这些相似度来找出与给定大众贡献合并请求最相似的大众贡献合并请求列表。为充分利用这几种类型的相似度，本发明采用了混合相似度的方式来计算两个大众贡献合并请求之间最终的相似度，最终相似度Sim(i，j)的计算公式如下：On the other hand, after computing various types of similarities between a given crowd-contributed pull request and historical crowd-contributed pull-requests, these similarities are used to find the crowd-contributed pull-request most similar to the given crowd-contributed pull-request list. In order to make full use of these types of similarities, the present invention adopts a mixed similarity method to calculate the final similarity between two public contribution merge requests, and the calculation formula of the final similarity Sim(i, j) is as follows:

Sim(i，j)＝a×SimText_title(i，j)+Sim(i,j)=a×SimText _title (i,j)+

b×SimText_desc(i，j)+b×SimText _desc (i, j)+

c×SimDiff_file(i，j)c×SimDiff _file (i, j)

在上面的式子中，Sim(i，j)的是由多种相似度加权组合过后的混合相似度，标题文本相似度SimText_title(i，j)，描述文本相似度SimText_desc(i，j)，变更相似度SimDiff_file(i，j)，它们对应的权重分别是a，b，c。为选取较优的权值，如图5所示，基于搜集的GitHub平台重复大众贡献合并请求数据集，利用贪心搜索算法来自动地确定它们的数值。该算法的输入包括一组重复大众贡献合并请求集合、算法迭代最大次数以及算法每次搜索时尝试的步长。最后，该算法返回一组局部最优的权重。在图5所示算法代码中，前3行(第1-3行)代码对三个权重进行初始化，并把权重值组成一个向量来进行操作，然后用初始化权重向量来得到初始的评估函数值。评估函数用于评估一组权重向量能否较好地反应出各种类型的相似度对实际大众贡献合并请求相似性的贡献度，即一组权重向量能否产生更符合实际情况的相似度比重。对于一个大众贡献合并请求来说，在返回的列表中，与它重复的大众贡献合并请求排序越靠前越好，因此评估函数fitness定义为:In the above formula, Sim(i, j) is a mixed similarity after a weighted combination of multiple similarities, the title text similarity SimText _title (i, j), and the description text similarity SimText _desc (i, j ), change the similarity SimDiff _file (i, j), and their corresponding weights are a, b, c respectively. In order to select better weights, as shown in Figure 5, based on the collected data sets of GitHub platform repeated public contribution merge requests, the greedy search algorithm is used to automatically determine their values. Inputs to the algorithm include a set of repeated crowd-contributed merge requests, the maximum number of algorithm iterations, and the step size the algorithm tries for each search. Finally, the algorithm returns a locally optimal set of weights. In the algorithm code shown in Figure 5, the first 3 lines (lines 1-3) initialize the three weights, and combine the weight values into a vector for operation, and then use the initialized weight vector to obtain the initial evaluation function value . The evaluation function is used to evaluate whether a set of weight vectors can better reflect the contribution of various types of similarity to the actual public contribution merge request similarity, that is, whether a set of weight vectors can produce a similarity ratio that is more in line with the actual situation . For a public contribution merge request, in the returned list, the higher the rank of the public contribution merge request that is duplicated with it, the better, so the evaluation function fitness is defined as:

上式中DupPR表示的是历史重复大众贡献合并请求数据集，wts表示当前的权重向量，<pre,prl>表示一对重复的大众贡献合并请求，SimPRs(Prl)返回的是与prl最为相似的一组大众贡献合并请求列表，Rank(pr_e，SimPRs(Prl))返回的结果是pr_e在列表中的排序。In the above formula, DupPR represents the historical repeated public contribution merge request data set, wts represents the current weight vector, <pre,prl> represents a pair of repeated public contribution merge requests, SimPRs(Prl) returns the most similar to prl A list of crowd-contributed merge requests, Rank( _pre ,SimPRs(Prl)) returns the rank of _pre in the list.

图5第4-21行代码迭代地搜索效果更好的权重参数，直到迭代次数达到了算法输入中指定的最大迭代次数。在第5行代码中，我们首先创建了一个列表用于存放每一次迭代中的搜索历史记录。在每次迭代中，我们尝试着从两个方向对权重向量进行改变:前向搜索(第7-10行代码)和反向搜索(第11-14行代码)。在每次搜索开始时，当前最优权重向量首先被记录保存(第7行和第12行代码)，在前向搜索中，要观察的那个权重会增加一个步长的单位(第8行代码)，而在反向搜索中，要观察的那个权重会减少一个步长的单位(第12行代码)。被更新后的权重向量被用于计算新的评估函数值(第9行和第13行代码)，与此同时，新的权重向量也被存储记录到历史记录search_history(第10行和第14行代码)。当所有的权重都被观察过后，即a，b，c三个权重被观察一遍，会从搜索历史中取出最高的函数评估值(第15行代码)，如果这个值比当前最优权重向量的评估函数值还要高，那么当前最优权重向量以及最优评估函数值相应地都会被更新(第16行-第19行)，否则不进行更新，然后下一轮的迭代开始。最终，图5所示算法输出表现最好的权重向量(第23行)。Lines 4-21 of Figure 5 iteratively search for weight parameters with better effects until the number of iterations reaches the maximum number of iterations specified in the algorithm input. On line 5, we first create a list to hold the search history for each iteration. In each iteration, we try to change the weight vector in two directions: forward search (lines 7-10) and backward search (lines 11-14). At the beginning of each search, the current optimal weight vector is first recorded and saved (lines 7 and 12), and in the forward search, the weight to be observed will be increased by a step unit (line 8 ), while in the reverse search, the weight to be observed will be reduced by one step unit (line 12). The updated weight vector is used to calculate the new evaluation function value (lines 9 and 13), and at the same time, the new weight vector is also stored and recorded in the history record search_history (lines 10 and 14 code). When all the weights have been observed, that is, the three weights of a, b, and c are observed once, the highest function evaluation value (line 15) will be taken from the search history. If this value is higher than the current optimal weight vector If the value of the evaluation function is even higher, then the current optimal weight vector and the value of the optimal evaluation function will be updated accordingly (line 16-line 19), otherwise no update will be performed, and then the next round of iteration will start. Finally, the algorithm shown in Figure 5 outputs the best performing weight vector (line 23).

S4、根据所述步骤S1至步骤S3，每一个历史大众贡献合并请求对应得到一个混合相似度，按照混合相似度值的大小进行排序，得到一组与新提交的大众贡献合并请求重复的历史大众贡献合并请求列表。实施例中可以预先设定一个top-k值，取列表中前top-k个历史大众贡献合并请求，即为新提交的大众贡献合并请求的最相似的大众贡献合并请求。S4. According to the above steps S1 to S3, each historical public contribution merge request corresponds to a mixed similarity, sorted according to the size of the mixed similarity value, and obtains a group of historical public that is duplicated with the newly submitted public contribution merge request Contribute to the list of merge requests. In the embodiment, a top-k value can be preset, and the top-k historical public contribution merge requests in the list are selected, which are the most similar public contribution merge requests to the newly submitted public contribution merge requests.

综上所述，本发明提出的基于混合相似度的大众贡献合并请求重复性探测方法能够及时探测大众贡献的重复性，避免产生重复的人工代码审查工作，提高大众贡献审查的效率。To sum up, the public contribution merge request duplication detection method based on mixed similarity proposed by the present invention can detect the duplication of public contributions in time, avoid repetitive manual code review work, and improve the efficiency of public contribution review.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

1. A public contribution merge request duplication detection method based on mixed similarity, is characterized in that, comprises the following steps:

S1. Calculate the text similarity between the newly submitted public contribution merge request and the historical public contribution merge request, the text similarity includes title text similarity and description text similarity;

S2. Calculate the change similarity between the newly submitted public contribution merge request and the historical public contribution merge request, the change similarity refers to the path similarity between files modified by the public contribution merge request;

S3. Collect a set of historical duplicate contribution merge request data sets on the collaborative development platform, and use a weight search algorithm based on a greedy strategy to train the historical duplicate contribution merge request data sets to obtain weight values of text similarity and change similarity , and further calculate the mixed similarity between public contribution merge requests according to the weight value;

S4. According to the above steps S1 to S3, each historical public contribution merge request corresponds to a mixed similarity, sorted according to the size of the mixed similarity value, and obtains a group of historical public that is duplicated with the newly submitted public contribution merge request Contribute to the list of merge requests.

2. a kind of public contribution merge request duplication detection method based on mixed similarity as claimed in claim 1, is characterized in that, the specific process of described step S1 is:

S11. Extract title text and description text from the newly submitted public contribution merge request and the historical public contribution merge request respectively, to obtain two title texts and two description texts;

S12. Preprocessing the title text and description text;

S13. Convert the preprocessed title text and description text into multi-dimensional vectors respectively, to obtain two title text vectors and two description text vectors;

S14. Use the Cosine formula to calculate the similarity between the two title text vectors, that is, the title text similarity between the newly submitted public contribution merge request and the historical public contribution merge request; use the Cosine formula to calculate the two description text vectors The similarity between the newly submitted public contribution merge request and the description text similarity of the historical public contribution merge request.

3. a kind of public contribution merge request repetition detection method based on mixed similarity as claimed in claim 1, is characterized in that, the specific process of described step S2 is:

S21. Extracting the files specifically modified by the newly submitted public contribution merge request and the historical public contribution merge request to obtain two file sets;

S22. Calculate the path similarity between two files in the two file sets, that is, the change similarity between the newly submitted public contribution merge request and the historical public contribution merge request.

4. A method for detecting duplication of public contribution merge requests based on mixed similarity as claimed in claim 1, characterized in that: said collaborative development platform is a GitHub platform.

5. A kind of public contribution merging request duplication detection method based on mixed similarity as claimed in claim 2, it is characterized in that, in said step S12, preprocessing title text and description text specifically includes word segmentation, conversion word root and Remove stop words.