[go: up one dir, main page]

CN120012754A - Text detection method, device, electronic device and computer-readable storage medium - Google Patents

Text detection method, device, electronic device and computer-readable storage medium Download PDF

Info

Publication number
CN120012754A
CN120012754A CN202510466943.1A CN202510466943A CN120012754A CN 120012754 A CN120012754 A CN 120012754A CN 202510466943 A CN202510466943 A CN 202510466943A CN 120012754 A CN120012754 A CN 120012754A
Authority
CN
China
Prior art keywords
segment
detected
text
plagiarism
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510466943.1A
Other languages
Chinese (zh)
Inventor
郑雯雯
杨代庆
王美玲
高继平
宋扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Scientific And Technical Information Of China
Original Assignee
Institute Of Scientific And Technical Information Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Scientific And Technical Information Of China filed Critical Institute Of Scientific And Technical Information Of China
Priority to CN202510466943.1A priority Critical patent/CN120012754A/en
Publication of CN120012754A publication Critical patent/CN120012754A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种文本检测方法、装置、电子设备及计算机可读存储介质,涉及自然语言处理领域。该方法包括:将待检测文本划分为至少一个待检测片段,确定待检测片段的向量表示,基于向量表示从检索树中确定待检测片段对应的参考片段;根据待检测片段和对应的参考片段之间的语义相似度和待检测片段的TF‑IDF值,确定待检测片段的目标相似度;从各待检测片段中,选择目标相似度大于预设相似度阈值的待检测片段作为目标检测片段;目标检测片段以及对应的参考片段输入类型预测模型,获得输出的目标检测片段对应的一类抄袭类型,并生成检测结果。本申请解决了无法检测到经过深度处理的文本的抄袭情况和无法精准区分文本的抄袭类型的问题。

The embodiments of the present application provide a text detection method, device, electronic device and computer-readable storage medium, which relate to the field of natural language processing. The method includes: dividing the text to be detected into at least one segment to be detected, determining the vector representation of the segment to be detected, and determining the reference segment corresponding to the segment to be detected from the retrieval tree based on the vector representation; determining the target similarity of the segment to be detected according to the semantic similarity between the segment to be detected and the corresponding reference segment and the TF-IDF value of the segment to be detected; from each segment to be detected, selecting the segment to be detected whose target similarity is greater than the preset similarity threshold as the target detection segment; the target detection segment and the corresponding reference segment are input into the type prediction model to obtain a type of plagiarism type corresponding to the output target detection segment, and generate a detection result. The present application solves the problem of being unable to detect plagiarism in deeply processed texts and being unable to accurately distinguish the types of plagiarism in texts.

Description

Text detection method, text detection device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of natural language processing technology, and in particular, to a text detection method, a device, an electronic apparatus, and a computer readable storage medium.
Background
The idea plagiarism is a relatively hidden and complex expression form in academic endless behaviors, and the core of the idea plagiarism is that a plagiarism carries out deep processing on original text contents through a semantic rewrite tool. This behavior is seemingly decoupled from direct text replication, but retains the core ideas and logic architecture of the original content and is therefore difficult to recognize by conventional text match detection systems.
With the rapid development of the generated artificial intelligence technology, a large-scale language model represented by GPT-4 has high-level semantic understanding and generating capability. These models enable the rewriting of input text, not only to replace words and phrases, but also to reorganize sentence structure and adjust expression logic, thereby generating text with high fluency and consistency.
The online rewrite tool also has similar functionality to some extent, with simple vocabulary replacement or syntactic adjustment to work with the input text. Such tools, while relatively primitive in operation, have been able to bypass duplicate checking systems based in part on string matching. In contrast, the rewritten text generated by the large model based on deep learning has more semantic consistency and is closer to human authoring in expression form, so that the concealment and camouflage of plagiarism are remarkably improved.
Therefore, the related text detection technology has the problems that the plagiarism condition of the text subjected to the advanced processing cannot be detected, and the plagiarism type of the text cannot be accurately distinguished.
Disclosure of Invention
The embodiment of the application provides a text detection method, a device, electronic equipment and a computer readable storage medium, which are used for solving the technical problems that the plagiarism condition of a deeply processed text cannot be detected and the plagiarism type of the text cannot be accurately distinguished.
According to a first aspect of the embodiment of the application, a text detection method is provided, which comprises the steps of dividing a text to be detected into at least one segment to be detected, determining vector representation of the segment to be detected for each segment to be detected, and determining a reference segment corresponding to the segment to be detected from a search tree based on the vector representation and a pre-constructed search tree, wherein the reference segment is a segment with highest semantic similarity with the segment to be detected in the search tree, and the search tree consists of vector representations of text segments of all texts in a corpus;
For each fragment to be detected, determining target similarity of the fragment to be detected according to semantic similarity between the fragment to be detected and a corresponding reference fragment and TF-IDF value of the fragment to be detected, wherein the TF-IDF value is used for representing importance of words in the fragment to be detected in a reference text, and the target similarity is used for representing similarity between the fragment to be detected and the corresponding reference fragment;
selecting a fragment to be detected with target similarity larger than a preset similarity threshold value from all fragments to be detected as a target detection fragment;
For each target detection segment, inputting the target detection segment and a reference segment corresponding to the target detection segment into a type prediction model to obtain a type of plagiarism type corresponding to the target detection segment output by the type prediction model, and generating a detection result, wherein the detection result comprises the plagiarism type corresponding to each target detection segment;
The type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
In one possible implementation manner, after obtaining a type of plagiarism type corresponding to the target detection segment output by the type prediction model, determining, for each target detection segment, a context matching degree between a context of the target detection segment and a context of the reference segment according to the target detection segment and the corresponding reference segment;
For each target detection segment, calculating the semantic similarity of each semantic unit in the target detection segment and each semantic unit in the corresponding reference segment, and constructing a contrast matrix corresponding to the target detection segment, wherein elements in the contrast matrix are the semantic similarity between the semantic units of the target detection segment and the semantic units of the reference text;
And for each target detection segment, generating a plagiarism report according to the plagiarism type corresponding to the target detection segment, the context matching degree between the target detection segment and the reference segment and the comparison matrix corresponding to the target detection segment.
In another possible implementation manner, according to a predefined window size, dividing the target detection segment into a plurality of first windows in a sliding window manner, and converting words of each first window into vector representations;
For each first window of the target detection segment, determining semantic similarity between the first window and the corresponding reference segment according to vector representation of the first window;
And determining the context matching degree between the target detection fragment and the reference fragment according to the semantic similarity corresponding to each first window of the target detection fragment.
In yet another possible implementation manner, the plagiarism report includes an overall plagiarism condition of the text to be detected and a local plagiarism condition, where the local plagiarism condition refers to a ratio of plagiarism text in each target detection segment, the overall plagiarism condition refers to a ratio of plagiarism text in the whole text to be detected, and the plagiarism text refers to text obtained after corresponding plagiarism type plagiarism reference segments in the target detection segment;
for each target detection segment, acquiring target elements larger than an element similarity threshold value from a contrast matrix corresponding to the target detection segment;
aiming at each target element, taking semantic units in a target detection fragment corresponding to the target element as plagiarism texts;
determining the first character number of the plagiarism text and the second character number of the target detection segment, and determining the local plagiarism condition according to the first character number and the second character number;
And determining the whole plagiarism condition according to the first character number of the plagiarism text of each target detection segment and the third character number of the text to be detected.
In another possible implementation manner, the text to be detected is segmented, and at least one keyword contained in the text to be detected is obtained;
acquiring a reference text corresponding to the reference fragment;
for each keyword, determining the occurrence frequency of the keywords in the fragments to be detected, and determining the number of fragments containing the keywords in the reference text;
And determining the TF-IDF value of the keywords in the reference text according to the frequency, the total fragment number of the reference text and the fragment number of the keywords contained in the reference text.
In yet another possible implementation, for each text in the corpus, the text is segmented into a plurality of text segments, and for each text segment, a vector representation of the text segment is determined;
Constructing a leaf node of the search tree based on the vector representation of each text segment of each text;
performing multiple recursions according to all leaf nodes until any recursion stopping condition is met;
wherein each round of recursion includes:
obtaining each element of the first round of recursion, wherein the element of the first round of recursion is a text fragment;
Clustering all elements through a clustering algorithm to obtain at least one cluster;
for each cluster, determining a vector representation of the cluster from the vector representations of the elements contained in the cluster;
The cluster is used as the element of the next round of recursion, the element obtained by the recursion is used as the node of the new layer in the search tree, and the node corresponding to the element obtained by the recursion is positioned at the upper layer of the corresponding node of the previous round in the search tree.
In yet another possible implementation, the recursive stopping condition includes any one of:
searching the tree to reach the preset layer number;
The number of clusters is lower than a preset number threshold;
The similarity between any two clusters is lower than a preset cluster similarity threshold.
In yet another possible implementation, the type of plagiarism output by the type prediction model includes at least one of:
Copying;
Semantic replacement;
Structural reorganization;
the plagiarism fragments with the copy type are obtained by copying sample fragments;
the plagiarism fragments with the plagiarism types being semantic replacement are obtained by rewriting sample fragments through a synonym replacement function;
the plagiarism fragment with the structure recombination type is obtained by recombining the sequence of the sample fragment.
According to a second aspect of an embodiment of the present application, there is provided a text detection apparatus, including:
The system comprises a dividing module, a searching module and a searching module, wherein the dividing module is used for dividing a text to be detected into at least one segment to be detected, determining vector representation of the segment to be detected for each segment to be detected, and determining a reference segment corresponding to the segment to be detected from a searching tree based on the vector representation and a pre-constructed searching tree;
the determining module is used for determining target similarity of the fragments to be detected according to semantic similarity between the fragments to be detected and the corresponding reference fragments and TF-IDF values of the fragments to be detected, wherein the TF-IDF values are used for representing importance of words in the fragments to be detected in the reference text, and the target similarity is used for representing similarity between the fragments to be detected and the corresponding reference fragments;
the selection module is used for selecting the fragments to be detected, the target similarity of which is greater than a preset similarity threshold value, from the fragments to be detected as target detection fragments;
the input module is used for inputting the target detection fragments and the reference fragments corresponding to the target detection fragments into the type prediction model to obtain a type of plagiarism type corresponding to the target detection fragments output by the type prediction model, and generating a detection result, wherein the detection result comprises the plagiarism type corresponding to each target detection fragment;
The type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
According to a third aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the steps of the method as provided in the first aspect when the program is executed.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.
According to a fifth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium, which when read from the computer readable storage medium by a processor of a computer device, the computer instructions are executed by the processor causing the computer device to perform the steps of the method as provided by the first aspect.
The technical scheme provided by the embodiment of the application has the beneficial effects that:
According to the text detection method provided by the embodiment of the application, the text to be detected is divided into at least one segment to be detected, the reference segment with the highest semantic similarity with each segment to be detected is determined from the search tree of the pre-built corpus according to the vector representation of each segment to be detected, the target similarity representing the similarity degree between the segment to be detected and the corresponding reference segment is determined according to the semantic similarity between the segment to be detected and the corresponding reference segment and the TF-IDF value of the segment to be detected, the segment to be detected with the target similarity being larger than the preset similarity threshold is selected as the target detection segment, the target detection segment is input into the pre-trained type prediction model for each target detection segment, and the plagiarism type corresponding to the target detection segment output by the type prediction model is obtained, so that the detection result containing the plagiarism type corresponding to each target detection segment is obtained.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic diagram of a system architecture for implementing a text detection method according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of a text detection method according to an embodiment of the present application;
Fig. 3 is a schematic flow chart of generating a plagiarism report in the text detection method according to the embodiment of the present application;
fig. 4 is a flow chart of a method for determining context matching degree in a text detection method according to an embodiment of the present application;
Fig. 5 is a flow chart of a method for acquiring TF-IDF values in a text detection method according to an embodiment of the present application;
Fig. 6 is a flow chart of a method for constructing a search tree in a text detection method according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of each iteration in a method for constructing a search tree according to an embodiment of the present application;
Fig. 8 is a flow chart of a method for determining local plagiarism and global plagiarism in a text detection method according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a text detection device according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
The text similarity is judged by comparing the overlapping ratio of the text to be detected and the existing documents in the database, and the method has better detection effect on the directly copied or slightly rewritten plagiarism behavior or compares the text with a sentence or a short text with a fixed length.
However, the text can be rewritten and reconstructed semantically through a GPT-4 isolarge model at present, and even if words or phrases in the original text are changed greatly, the semantics of sentences and paragraphs can still be kept consistent, so that the related text detection method relies on literal matching, and when the sentence-like good vocabulary of the text is changed completely, the related technology cannot effectively sense the plagiarism of the semantics, and the problems that the plagiarism condition of the text subjected to advanced processing cannot be detected and the plagiarism type of the text cannot be distinguished accurately exist.
Aiming at least one technical problem or a place needing improvement in the related technology, the application provides a text detection method, the method divides a text to be detected into at least one segment to be detected, according to the vector representation of each segment to be detected, a reference segment with highest semantic similarity with each segment to be detected is determined from a search tree of a pre-constructed corpus, according to the semantic similarity between the segment to be detected and the corresponding reference segment and the TF-IDF value of the segment to be detected, the target similarity representing the similarity degree between the segment to be detected and the corresponding reference segment is determined, the segment to be detected with the target similarity being greater than a preset similarity threshold is selected as the target detection segment, the target detection segment is input into a pre-trained type prediction model for each target detection segment, the plagiarism type corresponding to the target detection segment is obtained, and therefore the detection result containing the plagiarism type corresponding to each target detection segment is obtained.
The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
Fig. 1 is a schematic diagram of a system architecture for implementing a text detection method according to an embodiment of the present application, where the system architecture includes a terminal 120 and a server 140.
The terminal 120 installs and runs an application program of the text detection method, and the terminal 120 is used for determining a detection result of the text to be detected.
The terminal 120 is connected to the server 140 through a wireless network or a wired network.
Server 140 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. Illustratively, the server 140 includes a processor 144 and a memory 142, the memory 142 including a display module 1421, a control module 1422, and a receiving module 1423. The server 140 is used to provide background services for applications of the method. Optionally, the server 140 performs primary computing, the terminal 120 performs secondary computing, or the server 140 performs secondary computing, the terminal 120 performs primary computing, or a distributed computing architecture is used between the server 140 and the terminal 120 for collaborative computing.
Optionally, the device type of the terminal comprises at least one of a smart phone, a tablet computer, an electronic book reader, a dynamic image expert compression standard audio layer 3 (MovingPicture Experts Group Audio Layer III, MP 3) player, a dynamic image expert compression standard audio layer 4 (Moving Picture Experts Group Audio Layer IV, MP 4) player, a laptop portable computer and a desktop computer.
Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The embodiment of the application does not limit the number of terminals and the equipment type.
The embodiment of the application provides a text detection method, as shown in fig. 2, which comprises the following steps:
s101, dividing a text to be detected into at least one segment to be detected, determining vector representation of the segment to be detected for each segment to be detected, and determining a reference segment corresponding to the segment to be detected from a search tree based on the vector representation and a pre-constructed search tree.
In the embodiment of the application, the text to be detected refers to the text to be subjected to plagiarism detection, and because the text to be detected is usually a long text, in order to accurately detect whether plagiarism exists in each part of the text to be detected, when plagiarism detection is performed on the text to be detected, the text to be detected is divided into at least one segment to be detected, then vector representation of each segment to be detected is determined, the vector representation of the segment to be detected can be obtained through a pre-trained large-scale language model, and the large-scale language model can be a bidirectional transformation pre-training language model, a Word2Vec model or a GPT series model.
In one example, the text to be detected is divided to obtain a set of fragments to be detected of the text to be detectedBy means of a modelGenerating text vector representationsThe specific formula is as follows:
Wherein, For fragments in text TIs embedded in the vector representation.
In the embodiment of the application, the reference segment is the segment with the highest semantic similarity with the segment to be detected in the search tree, the search tree is composed of vector representations of text segments of all texts in the corpus, that is, the search tree of the corpus is a tree for storing vector representations of each text segment of each text in the corpus, the search tree can remarkably accelerate the search process of the reference segment by organizing text vectors of each text in the corpus into a tree structure, and the reference segment with the highest semantic similarity with the segment to be detected can be searched from the search tree by determining the semantic similarity between the segment to be detected and the stored text segment in the search tree.
In the embodiment of the application, the semantic similarity between the segment to be detected and the text segment in the search tree can be determined by calculating the cosine similarity between the vector representation of the segment to be detected and the vector representation of the text segment.
In the embodiment of the application, after the vector representation of each segment to be detected is determined, for each segment to be detected, calculating the semantic similarity between the text segments to be detected and the text segments in the search tree through the vector representation of the segment to be detected and the vector representation of the text segments in the search tree, sequencing the semantic similarity according to the sequence from high similarity to low similarity, and selecting the text segment with the highest semantic similarity between the text segments to be detected as the reference segment.
S102, for each fragment to be detected, determining the target similarity of the fragment to be detected according to the semantic similarity between the fragment to be detected and the corresponding reference fragment and the TF-IDF value of the fragment to be detected.
In the embodiment of the application, the target similarity is used for representing the similarity degree between the fragment to be detected and the corresponding reference fragment, and the target similarity between the fragment to be detected and the corresponding reference fragment is calculated from multiple dimensions.
In the embodiment of the application, the TF-IDF value is used for representing the importance of words in the fragment to be detected in the reference text, and the target similarity is obtained by carrying out weighted combination on the semantic similarity and the TF-IDF value, wherein the specific formula is as follows:
Wherein, The degree of similarity of the objects is indicated,Representing semantic similarity between the text to be detected and the reference text,A TF-IDF value representing the text to be detected,And the weight parameter is used for adjusting the importance of semantic matching and keyword matching between the text to be detected and the reference text.
S103, selecting the to-be-detected fragments with target similarity larger than a preset similarity threshold from the to-be-detected fragments as target detection fragments.
In the embodiment of the application, after the target similarity between each text to be detected and the corresponding reference segment is calculated, the segment to be detected with the target similarity being larger than the preset similarity threshold is selected as the target detection segment, namely the segment to be detected with lower plagiarism probability is screened out through the preset similarity threshold, and then the rest segment to be detected is used as the target detection segment, so that the plagiarism condition of the text to be detected is further analyzed.
S104, inputting the target detection fragments and the reference fragments corresponding to the target detection fragments into a type prediction model to obtain a type of plagiarism type corresponding to the target detection fragments output by the type prediction model, and generating a detection result.
In the embodiment of the application, the type prediction model is used for determining the plagiarism type of the fragment to be detected according to the fragment to be detected and the reference fragment, wherein the plagiarism type refers to the plagiarism mode of the reference fragment corresponding to the fragment to be detected, such as the plagiarism mode of directly copying the reference fragment as the fragment to be detected, or the plagiarism mode of semantically replacing part of the text of the reference fragment to obtain the fragment to be detected, or the plagiarism mode of recombining the structure of the reference text to obtain the fragment to be detected.
In the embodiment of the application, the detection result comprises the plagiarism types corresponding to each target detection segment, that is, because the target detection segments input into the type prediction model are all segments to be detected with high plagiarism probability, the type prediction model outputs a plagiarism type based on the segments to be detected and the corresponding reference segments, and the detection result comprises the plagiarism type corresponding to each target detection segment.
In the embodiment of the application, the type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
In the embodiment of the application, the plagiarism type is determined, then the sample fragment is converted into the plagiarism fragment corresponding to the plagiarism type based on the plagiarism type, the sample fragment and the plagiarism fragment are used as training samples, the plagiarism type is used as training labels, the type prediction model is trained, and the type prediction model is trained by generating the plagiarism fragments of different plagiarism types, so that the type prediction model can identify various plagiarism types.
In the embodiment of the application, the type prediction model can be trained in a Few-shot learning mode, and Few-shot learning refers to training of the type prediction model with high efficiency and strong generalization capability under the condition that only a small amount of annotation data exists. The method aims at enabling the model to understand new tasks and make accurate predictions on the basis of a few training samples and training labels.
According to the text detection method provided by the embodiment of the application, the text to be detected is divided into at least one segment to be detected, the reference segment with the highest semantic similarity with each segment to be detected is determined from the search tree of the pre-built corpus according to the vector representation of each segment to be detected, the target similarity representing the similarity degree between the segment to be detected and the corresponding reference segment is determined according to the semantic similarity between the segment to be detected and the corresponding reference segment and the TF-IDF value of the segment to be detected, the segment to be detected with the target similarity being larger than the preset similarity threshold is selected as the target detection segment, the target detection segment is input into the pre-trained type prediction model for each target detection segment, and the plagiarism type corresponding to the target detection segment output by the type prediction model is obtained, so that the detection result containing the plagiarism type corresponding to each target detection segment is obtained.
Based on the foregoing embodiments, as an optional embodiment, after obtaining a type of plagiarism type corresponding to the target detection segment output by the type prediction model, a method for generating a plagiarism report is shown in fig. 3, and the specific content is as follows:
s201, for each target detection segment, determining context matching degree between the context of the target detection segment and the context of the reference segment according to the target detection segment and the corresponding reference segment;
S202, for each target detection segment, calculating semantic similarity of each semantic unit in the target detection segment and each semantic unit in the corresponding reference segment, and constructing a contrast matrix corresponding to the target detection segment;
S203, for each target detection segment, generating a plagiarism report according to the plagiarism type corresponding to the target detection segment, the context matching degree between the target detection segment and the reference segment and the comparison matrix corresponding to the target detection segment.
In S201 of the embodiment of the present application, the context matching degree is used to characterize the matching degree of the context between the target detection segment and the reference segment, and by comparing the matching degree of the context between the target detection segment and the reference segment, the similarity of the target detection segment and the reference segment in expression can be determined, and the plagiarism situation of the text to be detected is detected through deep analysis, so that the originality of the text to be detected is evaluated more accurately, the accuracy of plagiarism detection is improved, and misjudgment caused by simple surface matching is avoided.
In the embodiment of the application, the semantic unit of the fragment refers to a basic unit capable of expressing a complete meaning in the fragment, and the semantic unit can be a single word, a plurality of sub-elements, a phrase or a sentence.
In S202 of the embodiment of the present application, the elements in the generated contrast matrix are the semantic similarity between the semantic units of the target detection segment and the semantic units of the reference text, so for each target detection segment, the semantic similarity between each semantic unit and each semantic unit of the reference segment needs to be calculated one by one, and then the contrast matrix corresponding to the current target detection segment is obtained.
In one example, semantic extraction is performed on the target detection segment and the corresponding reference segment respectively to obtain at least one semantic unit of the target detection segment and at least one semantic unit of the reference segment, each semantic unit of the target detection segment and each semantic unit of the reference segment are converted into vector representations, semantic similarity between the semantic unit and any semantic unit of the reference segment is determined for each semantic unit of the target detection segment, and a comparison matrix corresponding to the target detection segment is constructed according to the semantic similarity between each semantic unit of the target detection segment and any semantic unit of the reference segment.
In S203 of the embodiment of the present application, for each target detection segment, a plagiarism type corresponding to the target detection segment is obtained, and a context matching degree between the target detection segment and the reference segment and a comparison matrix corresponding to the target detection segment are obtained, so as to obtain a plagiarism report representing a plagiarism condition of a text to be detected.
In the scheme, the degree of matching between each target detection segment and the reference segment is calculated, the comparison matrix between each target detection segment and the corresponding reference segment is combined with the corresponding plagiarism type of the target detection segment, and the plagiarism situation of the text to be detected is considered from multiple dimensions, so that a plagiarism report showing the whole plagiarism situation and the local plagiarism situation is generated, and the plagiarism detection accuracy is improved. By gradually comparing semantic units of the target detection segment and the reference segment, the model can more accurately capture local similarity. This makes it possible to identify subtle differences that may be ignored over a wide range.
On the basis of the above embodiments, as an alternative embodiment, a method for determining a context matching degree is provided, as shown in fig. 4, and the specific contents include:
s301, dividing a target detection segment into a plurality of first windows in a sliding window mode according to a predefined window size, and converting words of each first window into vector representations;
s302, for each first window of the target detection segment, determining semantic similarity between the first window and the corresponding reference segment according to vector representation of the first window;
s303, determining the context matching degree between the target detection fragment and the reference fragment according to the semantic similarity corresponding to each first window of the target detection fragment.
In the embodiment of the application, the sliding window represents a text segment with a fixed length, and when the sliding window slides, the content in the window changes along with the change of the position.
In S301 of the embodiment of the present application, according to a predefined window size, a target detection segment is divided into a plurality of first windows, a reference segment corresponding to the target detection segment is divided into a plurality of second windows, and words in each first window are converted into vector representations, that is, for each target detection segment, there are a plurality of first windows, and text content in each first window of the target detection segment is inconsistent.
In one example, the target detection segment is "the sliding window is a text segment with a fixed length", the preset length of the sliding window is 5, which indicates that five words are analyzed each time, and the plurality of first windows obtained by splitting include:
The first window is [ "sliding", "window", "yes", "one", "fixed" ]
The second first window is [ "window", "yes", "one", "fixed", "length" ]
The third first window is [ "Yes", "one", "fixed", "length" ]
The fourth first window, [ "fixed", "length", "text", "fragment" ]
In S302 of the embodiment of the present application, for each first window, a cosine similarity between the vector representation of the first window and the vector representation of the corresponding reference segment is calculated, thereby obtaining a semantic similarity between the first window and the reference segment.
In S303 of the embodiment of the present application, the average value of the semantic similarity between each first window and the reference segment may be calculated to be used as the context matching degree between the target detection segment and the reference segment, or the context matching degree between the target detection segment and the reference segment may be obtained by weighting the semantic similarity, for example, different weights may be given according to the semantic importance of different first windows, and then weighted average may be performed on each semantic similarity, so as to obtain the context matching degree, or the largest semantic similarity may be selected from the calculated semantic similarity to be used as the context matching degree.
In the scheme, the target detection fragments are divided into windows with preset lengths in a sliding window mode, semantic similarity between each first window and the reference text is calculated, so that context matching degree of local contexts can be obtained, similarity between the target detection fragments and the reference text of different parts is analyzed, accuracy of the context matching degree is improved, context matching degree between the target detection fragments and the reference text is calculated in a sliding window mode, window sizes can be dynamically adjusted, suitable comparison ranges can be found in contexts with different lengths, and important context information is prevented from being omitted in long texts. The accuracy of the context matching degree calculation is improved.
On the basis of the above embodiments, as an optional embodiment, the embodiment of the present application further provides a method for obtaining TF-IDF values of fragments to be detected, as shown in fig. 5, where the content is as follows:
s401, segmenting a fragment to be detected to obtain at least one keyword contained in the fragment to be detected;
S402, acquiring a reference text corresponding to the reference fragment;
S403, for each keyword, determining the occurrence frequency of the keyword in the text to be detected, and determining the number of fragments containing the keyword in the reference text;
S404, determining the TF-IDF value of the keywords in the reference text according to the frequency, the total number of fragments of the reference text and the number of fragments containing the keywords in the reference text.
In S401 of the embodiment of the present application, the keywords refer to words that can represent the main content representing the text to be detected and are highly related to the text subject, the segment to be detected is subjected to word segmentation processing, the text is segmented into independent words, and a plurality of keywords that are highly related to the text subject are obtained.
In S402 of the embodiment of the present application, the full text to which the reference fragment belongs, i.e., the reference text, is determined according to the reference fragment corresponding to the fragment to be detected.
In S403 of the embodiment of the present application, for each keyword, the frequency of occurrence of the keyword in the text to be detected is determined, and the number of fragments containing the keyword in the reference text is determined.
In S404 of the embodiment of the present application, according to the frequency, the total number of segments of the reference text, and the number of segments of the reference text including the keyword, the TF-IDF value of the keyword in the reference text is determined, and the specific formula is as follows:
Wherein, Representing keywordsFragments to be detectedIs used to determine the frequency of occurrence of the signal,To contain key wordsIs used for the number of fragments of a sequence,Is the total number of fragments of the reference text.
In the scheme, the TF-IDF value of the keyword is captured, so that the influence of common words is avoided, the accuracy of similarity calculation is improved, the method and the device are applicable to different text fields and text lengths, the applicability is high, and the accuracy is high.
On the basis of the above embodiments, as an alternative embodiment, a method for constructing a search tree is provided, as shown in fig. 6, and the specific contents are as follows:
S501, for each text in a corpus, segmenting the text into a plurality of text segments, and for each text segment, determining a vector representation of the text segment;
s502, constructing a leaf node of a search tree based on the vector representation of each text segment of each text;
and S503, performing multiple recursions according to all the leaf nodes until any recursion stopping condition is met.
In S501 of the embodiment of the present application, when a search tree of a corpus is constructed, each text in the corpus is first segmented, the text is segmented into a plurality of text segments of a predefined length, and for each text segment, a vector representation of the text segment is produced based on a large language model.
In S502 of the embodiment of the present application, in the process of constructing a search tree, a vector representation of each text segment of each text is formed into leaf nodes of a search tree.
In S503 of the embodiment of the present application, performing multiple iterations according to all leaf nodes refers to performing multiple clustering on text segments with high semantic similarity through a clustering algorithm, so as to obtain abstract semantic vector representations of multiple high-level nodes until any recursive stopping condition is met.
Referring to fig. 7, the flow of each round of recursion includes:
S601, obtaining each element of the round of recursion, wherein the element of the first round of recursion is a text fragment;
S602, clustering all elements through a clustering algorithm to obtain at least one cluster;
S603, for each cluster, determining the vector representation of the cluster according to the vector representation of the elements contained in the cluster;
s604, taking the cluster as the element of the next round of recursion, taking the element obtained by the recursion of the round as the node of the new layer in the search tree, and locating the node corresponding to the element obtained by the recursion of the round at the upper layer of the corresponding node of the previous round in the search tree.
In S601 of the embodiment of the present application, in the first round of recursion, elements participating in recursion are text fragments, and in the non-first round of recursion, elements participating in recursion are clusters generated in the last round of recursion.
In S602 of the embodiment of the present application, all elements are clustered by a clustering algorithm to obtain at least one cluster, where the similarity between vector representations of any two elements in each cluster is not less than a preset cluster similarity threshold. In the first round of recursion, determining the vector representation of each text segment, calculating the similarity between the vector representations of any two text segments, then clustering the text segments through a clustering algorithm and the similarity between the vector representations of any two text segments to obtain at least one cluster, wherein each cluster comprises at least one text segment, in the non-first round of recursion, determining the vector representation of each element, calculating the similarity between the vector representations of any two elements, and clustering the elements according to the clustering algorithm and the similarity between the vector representations of any two elements to obtain at least one cluster, wherein each cluster comprises at least one element, and the clustering algorithm can be based on a Gaussian mixture model clustering algorithm.
In S603 of the embodiment of the present application, for each cluster, according to the vector representation of the element included in the cluster, the vector representation of the current cluster may be determined, with the following specific formula:
Wherein, For a collection of elements contained in a cluster,Is a vector representation of the elements of the set,Is a vector representation of the cluster.
In the first round of recursion, for each cluster, determining the text fragment contained in the cluster, determining the vector representation of the cluster according to the vector representation of the contained text fragment, and in the non-first round of recursion, determining the element contained in the cluster, and determining the vector representation of the cluster according to the vector representation of the contained element.
In S604 of the embodiment of the present application, after determining the vector representation of the cluster, the cluster obtained by the present round of recursion is used as an element participating in the next round of recursion, the cluster obtained by the present round of recursion is used as a node of a new layer in the search tree, and the node corresponding to the element obtained by the present round of recursion is located at the previous layer of the corresponding node of the previous round in the search tree.
In one example, during the first round of recursion, clusters A, B and C are obtained, cluster A containing text segment 1 and text segment 2, cluster B containing text segment 3 and text segment 4, and cluster C containing text segments 5 and 6. Then in the search tree, the leaf nodes corresponding to text segment 1 and text segment 2 are children of the node corresponding to cluster a, the leaf nodes corresponding to text segment 3 and text segment 4 are children of the node corresponding to cluster B, and the leaf nodes corresponding to text segment 5 and text segment 6 are children of the node corresponding to cluster C.
In the scheme, the text fragments are aggregated based on the similarity among the text fragments by constructing the retrieval tree of the corpus, and the text data of the corpus are efficiently organized and indexed, so that the whole corpus is not required to be traversed when the reference fragments are retrieved, and the time consumed by searching is greatly reduced.
In an embodiment of the present application, the recursive stopping condition includes any one of the following:
searching the tree to reach the preset layer number;
The number of clusters is lower than a preset number threshold;
The similarity between any two clusters is lower than a preset cluster similarity threshold.
In the embodiment of the application, the layer number of the search tree can be predefined, and when the search tree constructed in the recursion process reaches the predefined layer number, the recursion is stopped to obtain the search tree.
In the embodiment of the application, the preset number of the clustering clusters can be preset, when the number of the clustering clusters obtained by clustering in the recursion process is lower than a preset number threshold value, the element is not required to be clustered at present, the clustering is stopped, and the search tree is obtained based on the current recursion result.
In the embodiment of the application, a clustering similarity threshold can be predefined, and when the similarity threshold between any two clustering clusters is lower than a preset clustering similarity threshold, the fact that the similarity between the current clustering clusters is lower is indicated not to be suitable for further clustering, so that recursion is stopped to obtain a search tree.
In the embodiment of the application, when the reference fragment of the fragment to be detected is searched in the search tree, a dynamic path selection strategy is adopted, and the node similarity is maximized through recursive search, wherein the specific formula is as follows:
Wherein, For the child node with the greatest similarity in the output cluster,For the vector representation of the fragment to be detected,Is a vector representation of the cluster.
Based on the above embodiments, as an alternative embodiment, the plagiarism type output by the type prediction model includes at least one of the following:
Copying;
Semantic replacement;
Structural reorganization;
the plagiarism fragments with the copy type are obtained by copying sample fragments;
the plagiarism fragments with the plagiarism types being semantic replacement are obtained by rewriting sample fragments through a synonym replacement function;
the plagiarism fragment with the structure recombination type is obtained by recombining the sequence of the sample fragment.
In the embodiment of the application, the copied plagiarism segment is obtained by copying a sample segment, and the formula is as follows:
Wherein, In the case of a sample fragment,Is a segment of a plagiarism whose type is replication.
In the embodiment of the application, the plagiarism fragments with the plagiarism types being semantic substitutions are obtained by rewriting sample fragments through a synonym substitution function, and the formula is as follows:
Wherein, For a plagiarism segment whose type of plagiarism is semantic replacement,The function is replaced by a synonym,In the case of a sample fragment,Words that are semantically replaced in the sample segment.
In the embodiment of the application, the plagiarism fragments with the structure recombination are obtained by recombining the sequences of the sample fragments, and the specific formulas are as follows:
Wherein, As a function of reordering the sentence sequences,In order to copy a plagiarism fragment of which the type is structural reorganization,For a sample segment, the sentence sequence set of the sample segment is
In the embodiment of the application, the training sample generated by the mode is combined with a small amount of manually marked training labels to construct a training setAs training input for Few-shot learning. Model training is optimized through a loss function, and the formula is as follows:
Wherein, For a true training tag to be used,The probabilities are output for the model.
In the embodiment of the application, the F1 score is used as an index to evaluate the model in the training process, and the higher the F1 score is, the better the model performance is.
In the scheme, the sample fragments are processed by using the operations corresponding to the plagiarism types, so that the type prediction model capable of identifying the plagiarism types such as duplication, semantic replacement and structure recombination is trained, the model is trained by a small number of manually marked cases, and the time cost and the labor cost required by model training are reduced.
On the basis of the above embodiments, as an alternative embodiment, a method for determining local plagiarism and global plagiarism is provided, as shown in fig. 8, and the specific contents are as follows:
S701, for each target detection segment, acquiring target elements larger than an element similarity threshold value from a contrast matrix corresponding to the target detection segment;
S702, regarding each target element, taking semantic units in a target detection segment corresponding to the target element as plagiarism texts;
s703, determining the first character number of the plagiarism text and the second character number of the target detection segment, and determining the local plagiarism condition according to the first character number and the second character number;
s704, determining the whole plagiarism condition according to the first character number of the plagiarism text of each target detection segment and the third character number of the text to be detected.
In the embodiment of the application, the plagiarism report is used for showing the whole plagiarism condition and the local plagiarism condition of the text to be detected, wherein the local plagiarism condition refers to the proportion of the plagiarism text in each target detection segment in the text to be detected, the whole plagiarism condition refers to the proportion of the plagiarism text in the whole text to be detected, and the plagiarism text refers to the text obtained by corresponding plagiarism type plagiarism reference segments in the text to be detected.
In S701 of the embodiment of the present application, a target element greater than an element similarity threshold is obtained from a comparison matrix of a target detection segment, so as to screen semantic units with a large similarity to a reference segment from the target detection segment, and the elements may be sorted in order from high to low, and then elements greater than the element similarity threshold are obtained as target elements.
In S702 of the embodiment of the present application, each target element has a semantic unit of a corresponding target detection segment, and because the semantic similarity of the target element representation is large, the semantic unit of the target detection segment corresponding to the target element is used as a plagiarism text in the target detection segment.
In S703 of the embodiment of the present application, according to the number of words of the plagiarism text and the number of words of the target detection segment, the duty ratio of the plagiarism text in the target detection segment may be determined, and the first number of characters is used to represent the number of characters of the plagiarism text, and the second number of characters is used to represent the number of characters of the target detection segment, so that the first number of characters of the plagiarism text and the second number of characters of the target detection segment are determined, and the ratio of the first number of characters and the second number of characters is used as the duty ratio of the plagiarism text in the target detection segment, to obtain the local plagiarism situation.
In S704 of the embodiment of the present application, the third number of characters is used to represent the number of characters of the text to be detected, and after determining the number of words of all the plagiarism texts and the number of full text of the text to be detected, the duty ratio of all the plagiarism texts in the full text of the text to be detected may be determined, so that the duty ratio of the plagiarism texts in the full text is determined according to the first number of characters of the plagiarism texts of each target detection segment and the third number of characters of the text to be detected, and the overall plagiarism situation is obtained.
In the embodiment of the application, the plagiarism report can comprise the plagiarism type of each target detection segment in the text to be detected, the local plagiarism condition and the whole plagiarism condition of the text to be detected, and the context matching degree of each target detection segment in the text to be detected and the reference text.
In embodiments of the present application, the plagiarism report may be presented in a natural language generation technique, which is an artificial intelligence technique used by large models that can convert data into natural language text that can be understood by humans. In the context of plagiarism reports, natural language generation techniques are used to convert the analysis results into an easily understood report. In natural language generation techniques, the generation of text is generally regarded as a sequence generation problem, i.e. constructing sentences word by word, optimizing the output of a plagiarism report by the following formulaIs characterized by the following steps:
wherein the sequence generation probability Representing word sequences that have been generatedOn the basis of (a) the next wordProbability of occurrence.
In the embodiment of the application, noise and redundant data are removed by using a natural language processing tool before the text to be detected is processed.
In the embodiment of the application, when the text to be detected is divided, the long document is divided into logic paragraphs, and a rule algorithm is adopted to ensure that the paragraphs are complete and avoid semantic loss.
In the embodiment of the application, when the text to be detected is determined, the text to be detected can be automatically subject-classified, so that the corpus related to the subject can be quickly matched in the retrieval stage.
The embodiment of the application provides a text detection device, as shown in fig. 9, the text detection device 90 may include a dividing module 901, a determining module 902, a selecting module 903 and an input module 904.
Specifically, a dividing module 901, configured to divide a text to be detected into at least one segment to be detected, determine, for each segment to be detected, a vector representation of the segment to be detected, and determine, based on the vector representation and a search tree constructed in advance, a reference segment corresponding to the segment to be detected from the search tree, where the reference segment is a segment with highest semantic similarity between the reference segment and the segment to be detected in the search tree, and the search tree is composed of vector representations of text segments of all texts in a corpus;
A determining module 902, configured to determine, for each fragment to be detected, a target similarity of the fragment to be detected according to a semantic similarity between the fragment to be detected and a corresponding reference fragment and a TF-IDF value of the fragment to be detected, where the TF-IDF value is used to represent importance of a word in the fragment to be detected in a reference text, and the target similarity is used to represent a similarity between the fragment to be detected and the corresponding reference fragment;
The selecting module 903 is configured to select, from among the to-be-detected fragments, a to-be-detected fragment having a target similarity greater than a preset similarity threshold as a target detection fragment;
The input module 904 is configured to input, for each target detection segment, the target detection segment and a reference segment corresponding to the target detection segment into the type prediction model, obtain a type of plagiarism type corresponding to the target detection segment output by the type prediction model, and generate a detection result, where the detection result includes the plagiarism type corresponding to each target detection segment;
The type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.
According to the text detection device provided by the embodiment of the application, the text to be detected is divided into at least one segment to be detected, the reference segment with the highest semantic similarity with each segment to be detected is determined from the search tree of the pre-built corpus according to the vector representation of each segment to be detected, the target similarity representing the similarity degree between the segment to be detected and the corresponding reference segment is determined according to the semantic similarity between the segment to be detected and the corresponding reference segment and the TF-IDF value of the segment to be detected, the segment to be detected with the target similarity being larger than the preset similarity threshold is selected as the target detection segment, the target detection segment is input into the pre-trained type prediction model for each target detection segment, and the plagiarism type corresponding to the target detection segment output by the type prediction model is obtained, so that the detection result containing the plagiarism type corresponding to each target detection segment is obtained.
The embodiment of the application provides an electronic device (computer device/system), which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a text detection method, and compared with the related technology, the method can realize the steps of the text detection method: according to the method, a to-be-detected text is divided into at least one to-be-detected segment, a reference segment with highest semantic similarity to each to-be-detected segment is determined from a search tree of a pre-built corpus according to vector representation of each to-be-detected segment, and according to the semantic similarity between the to-be-detected segment and the corresponding reference segment and TF-IDF value of the to-be-detected segment, target similarity representing similarity degree between the to-be-detected segment and the corresponding reference segment is determined, the to-be-detected segment with target similarity being larger than a preset similarity threshold is selected as a target detection segment, the target detection segment is input into a pre-trained type prediction model for each target detection segment, and then a plagiarism type corresponding to the target detection segment output by the type prediction model is obtained.
In an alternative embodiment, an electronic device is provided, as shown in FIG. 10, the electronic device 4000 shown in FIG. 10 comprising a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
Among them, the electronic device package may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 10 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content. The method comprises the steps of dividing a text to be detected into at least one segment to be detected, determining a reference segment with highest semantic similarity with each segment to be detected from a search tree of a pre-built corpus according to vector representation of each segment to be detected, determining target similarity representing similarity degree between the segment to be detected and the corresponding reference segment according to semantic similarity between the segment to be detected and TF-IDF value of the segment to be detected, selecting the segment to be detected with target similarity larger than a preset similarity threshold as a target detection segment, inputting the target detection segment into a pre-trained type prediction model for each target detection segment, and obtaining a plagiarism type corresponding to the target detection segment output by the type prediction model, thereby obtaining a detection result containing the plagiarism type corresponding to each target detection segment.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor. Compared with the prior art, can realize:
According to the method, a to-be-detected text is divided into at least one to-be-detected segment, a reference segment with highest semantic similarity to each to-be-detected segment is determined from a search tree of a pre-built corpus according to vector representation of each to-be-detected segment, and according to the semantic similarity between the to-be-detected segment and the corresponding reference segment and TF-IDF value of the to-be-detected segment, target similarity representing similarity degree between the to-be-detected segment and the corresponding reference segment is determined, the to-be-detected segment with target similarity being larger than a preset similarity threshold is selected as a target detection segment, the target detection segment is input into a pre-trained type prediction model for each target detection segment, and then a plagiarism type corresponding to the target detection segment output by the type prediction model is obtained.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.
The foregoing is only an optional implementation manner of some implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, which also belongs to the protection scope of the embodiments of the present application.

Claims (12)

1. A text detection method, comprising:
Dividing a text to be detected into at least one segment to be detected, determining vector representation of the segment to be detected for each segment to be detected, and determining a reference segment corresponding to the segment to be detected from a search tree based on the vector representation and a pre-constructed search tree, wherein the reference segment is a segment with highest semantic similarity between the search tree and the segment to be detected, and the search tree consists of vector representations of text segments of all texts in a corpus;
for each fragment to be detected, determining target similarity of the fragment to be detected according to semantic similarity between the fragment to be detected and a corresponding reference fragment and TF-IDF value of the fragment to be detected, wherein the TF-IDF value is used for representing importance of words in the fragment to be detected in a reference text, and the target similarity is used for representing similarity degree between the fragment to be detected and the corresponding reference fragment;
Selecting a fragment to be detected, of which the target similarity is greater than a preset similarity threshold, from all fragments to be detected as a target detection fragment;
Inputting the target detection fragments and reference fragments corresponding to the target detection fragments into a type prediction model for each target detection fragment, obtaining a type of plagiarism type corresponding to the target detection fragments output by the type prediction model, and generating a detection result, wherein the detection result comprises the plagiarism type corresponding to each target detection fragment;
the type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
2. The method according to claim 1, wherein obtaining a type of plagiarism corresponding to the target detection segment output by the type prediction model further comprises:
For each target detection segment, determining the context matching degree between the context of the target detection segment and the context of the reference segment according to the target detection segment and the corresponding reference segment, wherein the context matching degree is used for representing the matching degree of the context between the target detection segment and the reference segment;
For each target detection segment, calculating the semantic similarity of each semantic unit in the target detection segment and each semantic unit in the corresponding reference segment, and constructing a contrast matrix corresponding to the target detection segment, wherein elements in the contrast matrix are the semantic similarity between the semantic units of the target detection segment and the semantic units of the reference text;
And for each target detection fragment, generating a plagiarism report according to the plagiarism type corresponding to the target detection fragment, the context matching degree between the target detection fragment and the reference fragment and the contrast matrix corresponding to the target detection fragment.
3. The method of claim 2, wherein determining a degree of context matching between the context of the target detection segment and the context of the reference segment comprises:
dividing the target detection segment into a plurality of first windows in a sliding window mode according to a predefined window size, and converting words of each first window into vector representations;
For each first window of the target detection segment, determining semantic similarity between the first window and a corresponding reference segment according to vector representation of the first window;
and determining the context matching degree between the target detection fragment and the reference fragment according to the semantic similarity corresponding to each first window of the target detection fragment.
4. The method according to claim 2, wherein the plagiarism report includes an overall plagiarism situation and a local plagiarism situation of the text to be detected, the local plagiarism situation being a ratio of plagiarism text in each target detection segment, the overall plagiarism situation being a ratio of plagiarism text in the whole text of the text to be detected, the plagiarism text being text obtained in the target detection segment after corresponding plagiarism-type plagiarism reference segments;
The step of obtaining the local plagiarism and the whole plagiarism comprises the following steps:
For each target detection segment, acquiring a target element larger than an element similarity threshold value from a contrast matrix corresponding to the target detection segment;
Aiming at each target element, taking semantic units in a target detection fragment corresponding to the target element as plagiarism texts;
Determining the first character number of the plagiarism text and the second character number of the target detection segment, and determining the local plagiarism condition according to the first character number and the second character number;
And determining the whole plagiarism condition according to the first character number of the plagiarism text of each target detection segment and the third character number of the text to be detected.
5. The method according to claim 1, wherein the step of obtaining TF-IDF values of the fragments to be detected comprises:
word segmentation is carried out on the text to be detected, and at least one keyword contained in the text to be detected is obtained;
acquiring a reference text corresponding to the reference fragment;
For each keyword, determining the occurrence frequency of the keyword in the fragments to be detected, and determining the number of fragments containing the keyword in the reference text;
And determining the TF-IDF value of the keyword in the reference text according to the frequency, the total fragment number of the reference text and the fragment number of the keyword contained in the reference text.
6. The method according to claim 1, wherein the search tree construction method comprises:
For each text in a corpus, segmenting the text into a plurality of text segments, and for each text segment, determining a vector representation of the text segment;
constructing a leaf node of the search tree based on the vector representation of each text segment of each text;
performing multiple recursions according to all leaf nodes until any recursion stopping condition is met;
wherein each round of recursion includes:
obtaining each element of the first round of recursion, wherein the element of the first round of recursion is a text fragment;
Clustering all elements through a clustering algorithm to obtain at least one cluster;
for each cluster, determining a vector representation of the cluster according to the vector representation of the element contained in the cluster;
And taking the cluster as an element of the next round of recursion, taking the element obtained by the round of recursion as a node of a new layer in the search tree, and locating the node corresponding to the element obtained by the round of recursion at the upper layer of the corresponding node of the previous round in the search tree.
7. The method of claim 6, wherein the recursive stopping condition comprises any one of:
the search tree reaches the preset layer number;
the number of the clusters is lower than a preset number threshold;
The similarity between any two clusters is lower than a preset cluster similarity threshold.
8. The method of claim 1, wherein the type of plagiarism output by the type prediction model comprises at least one of:
Copying;
Semantic replacement;
Structural reorganization;
the plagiarism type is copied plagiarism fragments obtained by copying sample fragments;
the plagiarism fragments with the plagiarism types being semantic substitutions are obtained by rewriting the sample fragments through a synonym substitution function;
The plagiarism fragment with the structure recombination is obtained by recombining the sequence of the sample fragment.
9. A text detection device, comprising:
The system comprises a dividing module, a searching module and a searching module, wherein the dividing module is used for dividing a text to be detected into at least one segment to be detected, determining vector representation of the segment to be detected for each segment to be detected, and determining a reference segment corresponding to the segment to be detected from the searching tree based on the vector representation and a pre-constructed searching tree, wherein the reference segment is a segment with highest semantic similarity with the segment to be detected in the searching tree, and the searching tree consists of vector representations of text segments of all texts in a corpus;
the determining module is used for determining target similarity of each fragment to be detected according to semantic similarity between the fragment to be detected and the corresponding reference fragment and TF-IDF value of the fragment to be detected, wherein the TF-IDF value is used for representing importance of words in the fragment to be detected in a reference text, and the target similarity is used for representing similarity between the fragment to be detected and the corresponding reference fragment;
The selection module is used for selecting the fragments to be detected, the target similarity of which is greater than a preset similarity threshold value, from the fragments to be detected as target detection fragments;
The input module is used for inputting the target detection fragments and the reference fragments corresponding to the target detection fragments into a type prediction model to obtain a type of plagiarism type corresponding to the target detection fragments output by the type prediction model, and generating a detection result, wherein the detection result comprises the plagiarism type corresponding to each target detection fragment;
the type prediction model is trained by taking a sample fragment and a plagiarism fragment corresponding to the sample fragment as training samples and taking the plagiarism type of the plagiarism fragment as a training label.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-8.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-8.
12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-8.
CN202510466943.1A 2025-04-15 2025-04-15 Text detection method, device, electronic device and computer-readable storage medium Pending CN120012754A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510466943.1A CN120012754A (en) 2025-04-15 2025-04-15 Text detection method, device, electronic device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510466943.1A CN120012754A (en) 2025-04-15 2025-04-15 Text detection method, device, electronic device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN120012754A true CN120012754A (en) 2025-05-16

Family

ID=95662722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510466943.1A Pending CN120012754A (en) 2025-04-15 2025-04-15 Text detection method, device, electronic device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN120012754A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110252030A1 (en) * 2010-04-09 2011-10-13 International Business Machines Corporation Systems, methods and computer program products for a snippet based proximal search
CN103678528A (en) * 2013-12-03 2014-03-26 北京建筑大学 Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection
CN112214984A (en) * 2020-10-10 2021-01-12 北京蚂蜂窝网络科技有限公司 Content plagiarism identification method, device, equipment and storage medium
CN113407738A (en) * 2021-07-12 2021-09-17 网易(杭州)网络有限公司 Similar text retrieval method and device, electronic equipment and storage medium
CN119203991A (en) * 2024-09-10 2024-12-27 深圳前海微众银行股份有限公司 Detection methods, devices, equipment, media and products

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110252030A1 (en) * 2010-04-09 2011-10-13 International Business Machines Corporation Systems, methods and computer program products for a snippet based proximal search
CN103678528A (en) * 2013-12-03 2014-03-26 北京建筑大学 Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection
CN112214984A (en) * 2020-10-10 2021-01-12 北京蚂蜂窝网络科技有限公司 Content plagiarism identification method, device, equipment and storage medium
CN113407738A (en) * 2021-07-12 2021-09-17 网易(杭州)网络有限公司 Similar text retrieval method and device, electronic equipment and storage medium
CN119203991A (en) * 2024-09-10 2024-12-27 深圳前海微众银行股份有限公司 Detection methods, devices, equipment, media and products

Similar Documents

Publication Publication Date Title
JP7282940B2 (en) System and method for contextual retrieval of electronic records
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
US20170193393A1 (en) Automated Knowledge Graph Creation
US20150170051A1 (en) Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation
CN111190997A (en) A Question Answering System Implementation Method Using Neural Networks and Machine Learning Sorting Algorithms
CN111274358A (en) Text processing method and device, electronic equipment and storage medium
US20160132589A1 (en) Context based passage retreival and scoring in a question answering system
US12340176B2 (en) Techniques for verifying veracity of machine learning outputs
Malaviya et al. Quest: A retrieval dataset of entity-seeking queries with implicit set operations
CN117633197B (en) Search information generation method and device applied to paraphrasing document and electronic equipment
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN113011156B (en) Quality inspection method, device, medium and electronic equipment for auditing text
CN115062135B (en) Patent screening method and electronic equipment
CN116975275A (en) Multilingual text classification model training method and device and computer equipment
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
Nguyen et al. Learning Reading Order via Document Layout with Layout2Pos
CN118395987A (en) BERT-based landslide hazard assessment named entity identification method of multi-neural network
KR102685135B1 (en) Video editing automation system
US12254265B2 (en) Generating unique word embeddings for jargon-specific tabular data for neural network training and usage
CN120012754A (en) Text detection method, device, electronic device and computer-readable storage medium
TWI897104B (en) Sensitive data identification method, device, equipment and computer storage medium
CN117217218B (en) Emotion dictionary construction method and device for science and technology risk event related public opinion
Xu et al. Is metadata of articles about COVID-19 enough for multilabel topic classification task?
CN120373306B (en) Feasibility study report intelligent analysis and information extraction method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination