[go: up one dir, main page]

CN115204147A - Data feature fingerprint construction and similarity measurement method and index - Google Patents

Data feature fingerprint construction and similarity measurement method and index Download PDF

Info

Publication number
CN115204147A
CN115204147A CN202210741611.6A CN202210741611A CN115204147A CN 115204147 A CN115204147 A CN 115204147A CN 202210741611 A CN202210741611 A CN 202210741611A CN 115204147 A CN115204147 A CN 115204147A
Authority
CN
China
Prior art keywords
data
feature
fingerprint
features
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210741611.6A
Other languages
Chinese (zh)
Inventor
周晓磊
华悦琳
范强
张骁雄
严浩
王芳潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210741611.6A priority Critical patent/CN115204147A/en
Publication of CN115204147A publication Critical patent/CN115204147A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a data characteristic fingerprint construction method based on a bloom filter, which is characterized in that a data characteristic set is mapped to a bit vector space with a fixed length, and all characteristics contained in data are uniformly characterized; the space complexity of the data feature set is effectively reduced to a constant level, and the storage cost of high-dimensional data features in the current big data is effectively reduced. The data feature similarity calculation adopts the Hamming distance of two fixed-length bit vectors to calculate, so that the similarity of data on features can be effectively represented, the complexity magnitude of the calculation is reduced, and the calculation efficiency is improved. For mass data feature similarity measurement, a data feature fingerprint index is constructed in an inverted index mode, corresponding feature fingerprints are segmented only, and then the segment of fingerprints are stored in a data feature inverted index database as index keys, so that similar data can be retrieved, and the data fingerprint retrieval matching efficiency is further improved.

Description

一种数据特征指纹构建及相似性度量方法与索引A Data Feature Fingerprint Construction and Similarity Measurement Method and Index

技术领域technical field

本发明涉及一种数据特征指纹,具体涉及一种数据特征指纹构建及相似性度量方法与索引。The invention relates to a data feature fingerprint, in particular to a data feature fingerprint construction and similarity measurement method and index.

背景技术Background technique

传统试验数据特征提取与识别方法集中在对单一类别或模态的试验数据处理上,包括数值计算、关联分析等,以获取该试验数据对象特征,并通过特征来进行关联及检索。The traditional test data feature extraction and identification methods focus on the processing of test data of a single category or mode, including numerical calculation, correlation analysis, etc., to obtain the characteristics of the test data object, and to perform correlation and retrieval through the characteristics.

根据统计,2019年、2020年全球数据流量分别达到每月180和230艾字节,预计到2026年,这一数据将增长至每月780艾字节。2019年固定数据流量占所有数据流量的近75%,随着移动终端和物联网设备数量的增加,预计移动宽带的数据流量将迅速增长,到2026年将达到总数据量的近三分之一。随着数据体量迅速增长及数据应用日益广泛,其所对应的特征规模也迅速增长,数据多维特征表征及相似性度量的需求迫切,对数据特征化旨在缩减数据规模且尽可能保留数据原有特征,而现有多维特征表征及相似性度量方法,面临着计算复杂,时间开销大,维护成本高等问题。According to statistics, global data traffic reached 180 and 230 exabytes per month in 2019 and 2020, respectively, and it is expected that by 2026, this data will increase to 780 exabytes per month. Fixed data traffic accounted for nearly 75% of all data traffic in 2019, and mobile broadband data traffic is expected to grow rapidly as the number of mobile terminals and IoT devices increases, reaching nearly one-third of total data traffic by 2026 . With the rapid growth of data volume and wide application of data, the corresponding feature scale also grows rapidly, and there is an urgent need for data multidimensional feature representation and similarity measurement. However, the existing multi-dimensional feature representation and similarity measurement methods face the problems of complex calculation, high time overhead and high maintenance cost.

当前,关于数据相似性度量的方法及其实现技术较多,有Sorensen-Dice系数、Jaro-Winkler距离、Shingle集合相似度、余弦距离等,其主要缺点在于:At present, there are many methods and implementation techniques for data similarity measurement, such as Sorensen-Dice coefficient, Jaro-Winkler distance, Shingle set similarity, cosine distance, etc. The main disadvantages are:

(1)、计算复杂性高:需要计算数据U与数据V所包含的全部特征的相似性才能返回结果,其复杂度为O(uv),其中u,v表示数据U及数据V中特征值的个数;(1) High computational complexity: it is necessary to calculate the similarity of all features contained in data U and data V to return the result, and its complexity is O(uv), where u and v represent the feature values in data U and data V the number of;

(2)、数据特征存储空间成本较高:需要存储任意数据的全部特征,由于特征值个数并非固定长度,因此其存储需要根据不同数据库选型,造成额外的空间占用。如选用关系型数据库(RDBMS),往往需要对任意数据增加一张表来存储数据特征;(2) The cost of data feature storage space is high: all features of any data need to be stored. Since the number of feature values is not a fixed length, its storage needs to be selected according to different databases, resulting in additional space occupation. If you choose a relational database (RDBMS), you often need to add a table to any data to store data features;

(3)、计算过程依赖原始数据特征值,存在敏感数据泄露隐患。(3) The calculation process relies on the original data characteristic values, and there is a hidden danger of sensitive data leakage.

关于数据指纹的技术,有如MD5哈希、局部敏感哈希、相似哈希等,通过不同的哈希函数将任意长度的原始数据映射为固定长度的散列值来构建数据指纹,能够有效解决计算复杂性、存储成本等问题,且散列本身不包含特征的语义信息,因此能够保护敏感数据的安全。Regarding data fingerprinting technologies, such as MD5 hashing, local-sensitive hashing, similar hashing, etc., data fingerprints are constructed by mapping original data of arbitrary length into hash values of fixed length through different hash functions, which can effectively solve the calculation problem. Complexity, storage cost and other issues, and the hash itself does not contain the semantic information of the feature, so it can protect the security of sensitive data.

然而,该类方法主要针对数据文件相似性,难以直接用于数据特征指纹构建。其主要缺点在于:However, this type of method is mainly aimed at the similarity of data files, and it is difficult to be directly used for the construction of data feature fingerprints. Its main disadvantages are:

(1)、可扩展性缺陷:当某一数据存在多维特征时,需要对多维特征逐个进行哈希并构建指纹,将导致数据特征指纹库存在大量冗余;(1) Scalability defects: when a certain data has multi-dimensional features, it is necessary to hash the multi-dimensional features one by one and build fingerprints, which will lead to a large amount of redundancy in the data feature fingerprint library;

(2)、可维护性差:当某一数据经过分析得出新的特征时,原数据指纹难以支持新的数据特征插入;(2) Poor maintainability: when a certain data is analyzed to obtain new features, it is difficult for the original data fingerprint to support the insertion of new data features;

(3)、难以支持异构数据、跨模态数据特征指纹构建:由于该类方法主要针对数据文件的内容,难以反映出具有类似特征的异构数据、跨模态数据的特征上的相似度。例如,针对同一对象进行描述的文本数据和图像数据,其数据内容完全不同却在数据上具有相似性,现有数据指纹技术难以支持跨模态的数据相似性度量。(3) It is difficult to support the construction of feature fingerprints of heterogeneous data and cross-modal data: Since this type of method is mainly aimed at the content of data files, it is difficult to reflect the similarity in features of heterogeneous data and cross-modal data with similar characteristics. . For example, text data and image data that describe the same object have completely different data content but have similarities in data. Existing data fingerprinting technology is difficult to support cross-modality data similarity measurement.

因此,亟需提出一种面向试验数据的多维特征指纹构造及数据特征相似度快速度量方法,以解决重复或相似数据特征检测、基于特征的数据检索查询、数据关联识别等应用需求。Therefore, there is an urgent need to propose a multi-dimensional feature fingerprint construction for experimental data and a rapid measurement method of data feature similarity to meet application requirements such as duplicate or similar data feature detection, feature-based data retrieval query, and data association identification.

发明内容SUMMARY OF THE INVENTION

为解决现有技术的不足,本发明的目的在于提供一种数据特征指纹构建方法,构建固定长度的数据特征指纹,并支持新生成数据特征维护,从而准确反应数据特征,能够支持数据异构性、跨模态性等复杂数据应用;并提供一种相似度度量方法,支持采用本发明构建的特征指纹的相似性度量,从而支持数据查询检索等方面的需要。In order to solve the deficiencies of the prior art, the purpose of the present invention is to provide a method for constructing a data feature fingerprint, which constructs a data feature fingerprint of a fixed length, and supports the maintenance of newly generated data features, so as to accurately reflect the data features and support data heterogeneity. , cross-modality and other complex data applications; and a similarity measurement method is provided, which supports the similarity measurement of the feature fingerprint constructed by the present invention, thereby supporting the needs of data query and retrieval.

为了实现上述目标,本发明采用如下的技术方案:In order to achieve above-mentioned goal, the present invention adopts following technical scheme:

一种数据特征指纹构建,包括以下步骤:A data feature fingerprint construction includes the following steps:

S1、使用TF-IDF方法对数据进行特征(词)提取;S1. Use the TF-IDF method to extract features (words) from the data;

S2、基于布隆过滤器结构,将不同数据的特征(词)通过哈希运算映射至对应的BF位向量表中,并将最终位向量中0和1构成的序列作为该数据特征的指纹输出。S2. Based on the Bloom filter structure, the features (words) of different data are mapped to the corresponding BF bit vector table through hash operation, and the sequence composed of 0 and 1 in the final bit vector is used as the fingerprint output of the data feature .

上述步骤S1中的使用TF-IDF方法,包括以下步骤:The TF-IDF method used in the above-mentioned step S1 includes the following steps:

A1、通过下式(1)计算提取的数据特征的词频TF:A1. Calculate the word frequency TF of the extracted data features by the following formula (1):

Figure BDA0003718227950000031
Figure BDA0003718227950000031

其中,i表示词的索引,j表示数据的文本索引,ni,j表示第i个词在第j个文本中出现的次数,分母表示第j个文本中的词的总数;即,TF值为某个词在一个文本中出现的频次与该文本的词的总数的比值;Among them, i represents the index of the word, j represents the text index of the data, n i,j represents the number of times the i-th word appears in the j-th text, and the denominator represents the total number of words in the j-th text; that is, the TF value is the ratio of the frequency of a word in a text to the total number of words in the text;

A2、通过下式(2)计算逆文本频率IDF:A2. Calculate the inverse text frequency IDF by the following formula (2):

Figure BDA0003718227950000032
Figure BDA0003718227950000032

其中,M表示文本的总数,mi表示包含第i个词的文本的数量,ɑ表示经验系数,一般取0.01;Among them, M represents the total number of texts, m i represents the number of texts containing the i-th word, and ɑ represents the empirical coefficient, generally 0.01;

A3、每个文本向量的特征项对应的TF-IDF权重,通过下式(3)计算:A3. The TF-IDF weight corresponding to the feature item of each text vector is calculated by the following formula (3):

TFIDFi,j=TFi,j·IDFi (3)TFIDF i,j =TF i,j ·IDF i (3)

A4、利用TF-IDF权重法计算出各个特征项的权重后,选择权重较大的数据作为文本的特征。A4. After calculating the weight of each feature item using the TF-IDF weight method, select data with a larger weight as the feature of the text.

上述步骤S1还包括步骤S10、对待提取的数据进行预处理,包括:The above step S1 also includes step S10, preprocessing the data to be extracted, including:

B1、清洗数据:以文本数据为例,包括删除HTML标签、非字母数字字符的特殊字符和重音字符、停止词;B1. Cleaning data: Take text data as an example, including deleting HTML tags, non-alphanumeric characters, special characters and accented characters, and stop words;

B2、扩展数据中的缩略语。B2. Abbreviations in extended data.

上述步骤S1还包括步骤S12、整理所提取的特征,使特征标准化,包括以下步骤:The above step S1 also includes step S12, sorting the extracted features to standardize the features, including the following steps:

C1、采用余弦相似度计算不同特征之间的相似度;C1. Use cosine similarity to calculate the similarity between different features;

C2、设置相似度阈值;C2. Set the similarity threshold;

C3、将特征两两比对;若相似度大于阈值,则删除重复特征;C3. Compare the features in pairs; if the similarity is greater than the threshold, delete the duplicate features;

C4、重复步骤C3至特征(描述词)唯一。C4. Repeat step C3 until the feature (descriptor) is unique.

上述步骤C1中余弦相似度的计算,具体为:The calculation of the cosine similarity in the above step C1 is specifically:

计算文本特征向量之间的距离,从而得到文本之间的相似度;一般来说,两个文本向量的距离越近,则两个文本越相似;Calculate the distance between text feature vectors to obtain the similarity between texts; in general, the closer the distance between two text vectors, the more similar the two texts are;

采用余弦相似度计算不同特征之间的相似度,公式为:The cosine similarity is used to calculate the similarity between different features, and the formula is:

Figure BDA0003718227950000041
Figure BDA0003718227950000041

其中,A=(x1,x2,…xn),B=(y1,y2,…yn),为文本向量。Among them, A=(x 1 , x 2 ,...x n ), B=(y 1 , y 2 ,... y n ), which are text vectors.

上述步骤S2中对数据特征进行哈希散列并映射到BF相应的位向量表中,具体为:In the above step S2, the data features are hashed and mapped to the corresponding bit vector table of the BF, specifically:

给定一组数据集A={a1,a2,…an},使用k个哈希函数{h1,h2,…hk}得到待插入数据特征的索引值hi(aj),其中i∈[1,k],j∈[1,n];Given a set of data sets A={a 1 , a 2 , ... a n }, use k hash functions {h 1 , h 2 , ... h k } to obtain the index values of the data features to be inserted h i (a j ), where i∈[1,k], j∈[1,n];

初始状态时,位向量表中所有值都置为0;插入时,将索引值hi(aj)对应的k个位置的位向量值置为1;In the initial state, all values in the bit vector table are set to 0; when inserting, the bit vector values of the k positions corresponding to the index value h i (a j ) are set to 1;

将最终的位向量表中0和1构成的序列作为数据特征的指纹输出。The sequence composed of 0 and 1 in the final bit vector table is output as the fingerprint of the data feature.

适用于上述的数据特征指纹的相似性度量方法,由于不同数据的相同特征可以被散列到BF的相同位置,数据含有的相同的特征值越多,两个数据对应的特征指纹的汉明距离会越短,因此可通过汉明距离对数据特征相似度进行度量,步骤包括:Applicable to the above similarity measurement method of data feature fingerprints, since the same features of different data can be hashed to the same position of the BF, the more the same feature values contained in the data, the Hamming distance of the feature fingerprints corresponding to the two data. will be shorter, so the similarity of data features can be measured by Hamming distance. The steps include:

D1、于数据特征指纹集中选取任意两个数据特征的指纹;D1. Select fingerprints of any two data features from the data feature fingerprint set;

D2、计算两个数据特征指纹之间的汉明距离;D2. Calculate the Hamming distance between two data feature fingerprints;

D3、基于汉明距离度量两个数据特征的相似性。D3. Measure the similarity of two data features based on the Hamming distance.

适用于上述的数据特征指纹的索引,包括倒排索引库的建立,包括以下步骤:The index applicable to the above-mentioned data feature fingerprints, including the establishment of an inverted index library, includes the following steps:

F1、将数据分别编号,以c为段长,将长度为m的数据的特征指纹分为

Figure BDA0003718227950000051
段;F1. Number the data respectively, take c as the segment length, and divide the feature fingerprint of the data with length m into
Figure BDA0003718227950000051
part;

F2、以每段的指纹值为指纹段值,并作为索引键;F2. Take the fingerprint value of each segment as the fingerprint segment value and use it as the index key;

F3、将同段位具有相同指纹段值的数据编号作为索引值;F3. Use the data number with the same fingerprint segment value in the same segment as the index value;

F4、将数据特征指纹以<指纹段值,数据编号>作为键值对,存入特征指纹索引表中,构成倒排索引库。F4. The data feature fingerprint is stored in the feature fingerprint index table with <fingerprint segment value, data number> as a key-value pair to form an inverted index library.

检索,包括以下步骤:Search, including the following steps:

G1、将待检索的数据特征指纹,以c为段长,分为

Figure BDA0003718227950000052
段;G1. Divide the feature fingerprint of the data to be retrieved, with c as the segment length, into
Figure BDA0003718227950000052
part;

G2、以指纹段值进行索引并获得每一段的数据编号;G2. Index with the fingerprint segment value and obtain the data number of each segment;

G3、将数据编号加入集合;G3. Add the data number to the collection;

G4、查找编号出现次数大于

Figure BDA0003718227950000062
次的数据,h为汉明距离;G4, the number of occurrences of the search number is greater than
Figure BDA0003718227950000062
times data, h is the Hamming distance;

G5、利用公式(5)计算待检索数据特征与索引库中已有特征的相似性:G5. Use formula (5) to calculate the similarity between the features of the data to be retrieved and the existing features in the index database:

Figure BDA0003718227950000061
Figure BDA0003718227950000061

式中,假设段长c值为4,num标记每次提取时对应指纹编号是否出现,若出现则num(b)为1,否则为0。In the formula, it is assumed that the segment length c is 4, and num marks whether the corresponding fingerprint number appears each time it is extracted. If it appears, num(b) is 1, otherwise it is 0.

本发明的有益之处在于:The benefits of the present invention are:

(1)空间利用率高。本发明提出的基于布隆过滤器的数据特征指纹构建方法,通过将数据特征集合映射到固定长度的位向量空间,将数据所包含的全部特征进行统一表征;有效地将数据特征集合的空间复杂度由O(N)降低至常数级,有效地降低了当前大数据中高维数据特征的存储成本。同时在添加数据特征时,数据特征仍然会映射到固定长度的位向量中,不需要额外的存储空间成本。(1) The space utilization rate is high. The method for constructing a data feature fingerprint based on the Bloom filter proposed by the present invention, by mapping the data feature set to a fixed-length bit vector space, uniformly characterizes all the features contained in the data; effectively the complex space of the data feature set is The degree is reduced from O(N) to a constant level, which effectively reduces the storage cost of high-dimensional data features in current big data. At the same time, when adding data features, the data features are still mapped to a fixed-length bit vector, and no additional storage space cost is required.

(2)计算复杂度低。本发明提出的数据相似性度量方法,使用汉明距离判断特征之间的相似度。对于多维数据特征,通过构建数据特征指纹库,分块对数据特征指纹进行存放,避免了传统的方法对多维特征逐个哈希并构建指纹造成大量数据特征指纹冗余的问题,也减少了数据特征间两两比对的次数。且对同一对象在不同模态下数据内容不同但数据特征相似的问题,基于布隆过滤器的数据特征指纹构建可以将数据映射到相同维度,以支持跨模态的数据相似性度量。通过采用两个固定长度位向量的海明距离进行计算,不仅能够有效表征数据在特征上的相似程度,还将计算复杂度降低到O(1)量级,大幅提高了海量数据的相似性度量的效率。(2) The computational complexity is low. The data similarity measurement method proposed by the present invention uses the Hamming distance to judge the similarity between features. For multi-dimensional data features, by building a data feature fingerprint database and storing data feature fingerprints in blocks, the traditional method of hashing multi-dimensional features one by one and constructing fingerprints causes a large number of data feature fingerprint redundancy problems, and also reduces data features. The number of pairwise comparisons. And for the problem that the data content of the same object is different in different modalities but the data features are similar, the data feature fingerprint construction based on Bloom filter can map the data to the same dimension to support the cross-modal data similarity measurement. By using the Hamming distance of two fixed-length bit vectors for calculation, it can not only effectively characterize the similarity degree of data in features, but also reduce the computational complexity to O(1) level, which greatly improves the similarity measurement of massive data. s efficiency.

(3)可扩展性与可维护性。由于采用布隆过滤器构造数据特征指纹,当经过分析或标注产生新的数据特征时,仅需要将新生成的数据特征加入到数据特征指纹之中即可,不需重新计算全部特征指纹。(3) Scalability and maintainability. Since the data feature fingerprint is constructed by using the Bloom filter, when a new data feature is generated after analysis or labeling, it is only necessary to add the newly generated data feature to the data feature fingerprint, and it is not necessary to recalculate all the feature fingerprints.

(4)敏感数据安全。由于使用不同哈希函数将任意指纹散列,而散列本身不包含特征的语义信息,因此能够保护敏感数据安全,防止数据泄露。(4) Sensitive data security. Since any fingerprint is hashed using different hash functions, and the hash itself does not contain the semantic information of the feature, it can protect the security of sensitive data and prevent data leakage.

(5)指纹索引效率。采用倒排索引方式构建数据特征指纹索引,仅需将对应特征指纹进行分段,然后将该段指纹作为索引键保存在数据特征倒排索引库中,就能够检索得到其相似数据,相比于传统的两两比对方法节省了大量的时间,进一步提升了数据指纹检索匹配效率。此外,本发明是针对数据特征来进行数据指纹构建的,能够对多模态数据的特征进行降维并构建形成统一维度的数据特征指纹,从而有效解决异构、跨模态检索难的问题。(5) Fingerprint indexing efficiency. Using the inverted index method to construct the data feature fingerprint index, it only needs to segment the corresponding feature fingerprint, and then save the segment fingerprint as the index key in the data feature inverted index database, and then the similar data can be retrieved. The traditional pairwise comparison method saves a lot of time and further improves the efficiency of data fingerprint retrieval and matching. In addition, the present invention constructs data fingerprints according to data features, which can reduce the dimensionality of the features of multimodal data and construct data feature fingerprints with unified dimensions, thereby effectively solving the problem of heterogeneous and difficult cross-modal retrieval.

附图说明Description of drawings

图1为布隆过滤器初始状态。Figure 1 shows the initial state of the Bloom filter.

图2为将索引值插对应位向量中的示意图。FIG. 2 is a schematic diagram of interpolating index values into corresponding bit vectors.

图3为将增加特征插对应位向量中的示意图。FIG. 3 is a schematic diagram of inserting an added feature into a corresponding bit vector.

图4为构建倒排索引库的示意图。FIG. 4 is a schematic diagram of constructing an inverted index library.

具体实施方式Detailed ways

以下结合附图和具体实施例对本发明作具体的介绍。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

以文本数据为例。Take text data as an example.

文本数据通常由文档组成,文档中包含大量的单词、语句和文本。由于文本文档没有固定的结构,且与结构化数据相比,文本文档的语句长度也不同,所以在提取数据特征时,首先要对数据进行清洗,删除如HTML标签、非字母数字字符的特殊字符和重音字符这样对文档内容价值不高的内容。同时,提取数据特征时,意义不大的词一般被称为停止词,这类词通常为在文档中出现词频最高的单词,如a、an、the等,在文本中识别出停止词并删除。然后将文本中的缩略语扩展有助于文本的标准化(S10)。Text data usually consists of documents, which contain a large number of words, sentences, and texts. Since text documents have no fixed structure, and compared with structured data, the sentence lengths of text documents are also different, so when extracting data features, the data should be cleaned first, and special characters such as HTML tags and non-alphanumeric characters should be removed. and accented characters that are of little value to the content of the document. At the same time, when extracting data features, words with little meaning are generally called stop words. Such words are usually words that appear with the highest frequency in the document, such as a, an, the, etc. Stop words are identified in the text and deleted. . Then expanding the abbreviations in the text contributes to the normalization of the text (S10).

数据特征指纹构建:Data feature fingerprint construction:

S1、使用TF-IDF方法对数据进行特征(词)提取:S1. Use the TF-IDF method to extract features (words) from the data:

A1、词频(term frequency,TF)通过下式计算:A1. Term frequency (TF) is calculated by the following formula:

Figure BDA0003718227950000081
Figure BDA0003718227950000081

其中,i表示词的索引,j表示数据的文本索引,ni,j表示第i个词在第j个文本中出现的次数,分母表示第j个文本中的词的总数;即,TF值为某个词在一个文本中出现的频次与该文本的词的总数的比值;Among them, i represents the index of the word, j represents the text index of the data, n i,j represents the number of times the i-th word appears in the j-th text, and the denominator represents the total number of words in the j-th text; that is, the TF value is the ratio of the frequency of a word in a text to the total number of words in the text;

A2、逆文本频率(inverse document frequency,IDF)通过下式计算:A2. The inverse document frequency (IDF) is calculated by the following formula:

Figure BDA0003718227950000082
Figure BDA0003718227950000082

其中,M表示文本的总数,mi表示包含第i个词的文本的数量,ɑ表示经验系数,一般取0.01,目的是防止分母为0;Among them, M represents the total number of texts, m i represents the number of texts containing the i-th word, and ɑ represents the empirical coefficient, which is generally taken as 0.01, in order to prevent the denominator from being 0;

A3、每个文本向量的特征项对应的TF-IDF权重,通过下式(3)计算:A3. The TF-IDF weight corresponding to the feature item of each text vector is calculated by the following formula (3):

TFIDFi,j=TFi,j.IDFi (3)TFIDF i,j =TF i,j .IDF i (3)

从上式可以看出,当一个词在单一文本中出现的频次很高,而很少出现在其他文本中时,则这个词的TF-IDF值就会很大;It can be seen from the above formula that when a word appears frequently in a single text and rarely appears in other texts, the TF-IDF value of the word will be very large;

A4、利用TF-IDF权重法计算出各个特征项的权重后,选择权重较大的数据作为文本的特征。A4. After calculating the weight of each feature item using the TF-IDF weight method, select data with a larger weight as the feature of the text.

接下来,对所获得的数据特征进行初步的整理,使数据特征标准化。通过判断特征相似性并删除重复或近似的特征描述词可以使数据特征更加清晰,也对之后的数据特征指纹相似性判别节省了计算量。数据特征的标准化主要是对同一数据重复或相似的特征描述词进行删除。步骤如下:Next, the obtained data features are preliminarily sorted to standardize the data features. By judging the feature similarity and removing duplicate or similar feature descriptors, the data features can be made clearer, and the computational load of the subsequent fingerprint similarity judgment of the data features can be saved. The standardization of data features is mainly to delete the duplicate or similar feature descriptors of the same data. Proceed as follows:

S12、计算文本特征向量之间的距离,从而得到文本之间的相似度。一般来说,两个文本特征向量的距离越近,则两个文本越相似。采用余弦相似度来判断不同特征之间的相似度(步骤C1),其公式为:S12. Calculate the distance between the text feature vectors, thereby obtaining the similarity between the texts. Generally speaking, the closer the distance between two text feature vectors, the more similar the two texts are. The cosine similarity is used to judge the similarity between different features (step C1), and the formula is:

Figure BDA0003718227950000091
Figure BDA0003718227950000091

其中,两个文本向量A=(x1,x2,…xn),B=(y1,y2,…yn)。Among them, two text vectors A=(x 1 , x 2 , . . . x n ), B=(y 1 , y 2 , . . . y n ).

然后设定相似度阈值(步骤C2);将特征两两比对,若相似度大于阈值,则删除其他相似特征,仅保留唯一特征(步骤C3);重复步骤至特征(描述词)唯一(步骤C4)。Then set the similarity threshold (step C2); compare the features pairwise, if the similarity is greater than the threshold, delete other similar features and keep only the unique feature (step C3); repeat the steps until the feature (descriptor) is unique (step C3) C4).

设计算法1:数据特征标准化Design Algorithm 1: Data Feature Normalization

输入:试验数据DSn,特征提取阈值σ1,相似度阈值σ2,特征集合St Input: test data DSn, feature extraction threshold σ 1 , similarity threshold σ 2 , feature set S t

输出:数据特征Output: data features

Figure BDA0003718227950000092
Figure BDA0003718227950000092

S2、通过哈希运算获得数据的不同特征(词)的索引值,采用布隆过滤器将索引值插入相应的位向量表中,以最终的位向量中0和1构成的序列作为该数据特征的指纹;具体为:S2. Obtain the index values of different features (words) of the data through hash operation, insert the index values into the corresponding bit vector table using a Bloom filter, and use the sequence composed of 0 and 1 in the final bit vector as the data feature fingerprints; specifically:

BF作为一种概率型数据结构,其方法是给定一组数据集A={a1,a2,…an},使用k个哈希函数{h1,h2,…hk}得到待插入数据的索引值hi(aj),其中i∈[1,k],j∈[1,n]。初始状态时,位向量表中所有值都置为0,插入时,将索引值hi(aj)对应的k个位置的位向量值置为1。如果索引值对应的位置已经为1,说明此时发生了哈希碰撞,之前已经有元素映射到了该位置,本次不再修改对应位置的值。将最终的位向量表中0和1构成的序列作为数据特征的指纹输出。As a probabilistic data structure, BF is given a set of data sets A={a 1 , a 2 ,...a n }, using k hash functions {h 1 , h 2 ,...h k } to get The index value hi (a j ) of the data to be inserted, where i∈ [1,k], j∈[1,n]. In the initial state, all values in the bit vector table are set to 0, and when inserting, the bit vector values of k positions corresponding to the index value h i (a j ) are set to 1. If the position corresponding to the index value is already 1, it means that a hash collision has occurred at this time, and an element has been mapped to this position before, and the value of the corresponding position will not be modified this time. The sequence composed of 0 and 1 in the final bit vector table is output as the fingerprint of the data feature.

实际操作时,In actual operation,

关于布隆滤波器参数设置及初始化:About Bloom filter parameter setting and initialization:

若布隆过滤器的位向量长度过小,那么插入数据特征时,映射到同一位向量的概率很大,即假阳性概率很大,随着数据量的增大,假阳性概率可能会达到1。If the length of the bit vector of the Bloom filter is too small, the probability of mapping to the same bit vector is very high when inserting data features, that is, the probability of false positive is very large. As the amount of data increases, the probability of false positive may reach 1.

为了减少大体量数据特征的密度,并在表示大量数据特征时减少散列冲突,布隆过滤器的位向量长度应该足够大。一个较长的布隆滤波器也保证了后续哈希映射中相似性检测的准确性。在初始化布隆过滤器时,对于长度为m的位数组,所有位向量值都设为0,其中m的大小与数据特征的大小有关,布隆过滤器初始状态如图1所示,此处选用m的长度为19。In order to reduce the density of large data features and reduce hash collisions when representing large data features, the bit vector length of the Bloom filter should be large enough. A longer Bloom filter also ensures the accuracy of similarity detection in subsequent hash maps. When initializing the Bloom filter, for a bit array of length m, all bit vector values are set to 0, where the size of m is related to the size of the data feature. The initial state of the Bloom filter is shown in Figure 1, where The length of m is chosen to be 19.

关于计算数据特征哈希值:About calculating the data feature hash value:

映射时采用的k个独立的哈希函数个数也需要权衡,个数越多则布隆过滤器映射速度会很快但会降低执行效率,而过少的哈希函数会增加误报率。The number of k independent hash functions used in mapping also needs to be weighed. The larger the number, the faster the Bloom filter mapping speed will be, but the execution efficiency will be reduced, while too few hash functions will increase the false positive rate.

根据布隆过滤器假阳性概率得出,当k=(m/n)ln2时假阳性率最低,此时哈希函数的个数是ln2倍位向量大小与数据特征元素n的比值。同时,若哈希函数的性能过低,也会影响计算效率,常用的哈希函数有MurmurHash、FNV和MD5等。According to the false positive probability of Bloom filter, when k=(m/n)ln2, the false positive rate is the lowest, and the number of hash functions is the ratio of ln2 times the bit vector size to the data feature element n. At the same time, if the performance of the hash function is too low, it will also affect the computational efficiency. Commonly used hash functions include MurmurHash, FNV, and MD5.

经过第一阶段的数据特征标准化后,每个数据都拥有一个特征集合,如数据A={特征1,特征2,特征3},然后对每个特征使用哈希函数{h1,h2,…hk}得到待插入数据的索引值,其索引值对应位向量中需要置为1的数组下标。如图2所示数据A的特征1利用哈希函数h1(x)和h2(x)将特征映射至位向量的第4位和第9位,特征2被映射至第2位和第14位,特征3被映射至第10位和第17位。After the data feature standardization in the first stage, each data has a feature set, such as data A={feature 1, feature 2, feature 3}, and then use the hash function {h 1 , h 2 , ...h k } obtains the index value of the data to be inserted, and the index value corresponds to the array subscript that needs to be set to 1 in the bit vector. As shown in Figure 2, feature 1 of data A uses hash functions h 1 (x) and h 2 (x) to map features to bits 4 and 9 of the bit vector, and feature 2 is mapped to bits 2 and 2 14 bits, feature 3 is mapped to bits 10 and 17.

关于生成数据特征指纹:About generating data feature fingerprints:

对于每个数据,在将数据特征映射至固定长度的位向量后,将得到一串由0和1组成的字符串,如图2所示,数据A在经过映射后位向量变为“0100100001100010010”,将此字符串作为数据特征指纹。For each data, after mapping the data features to a fixed-length bit vector, a string of 0 and 1 will be obtained, as shown in Figure 2, the bit vector of data A becomes "0100100001100010010" after mapping , use this string as the data feature fingerprint.

若数据经过分析后得出新的特征,只需将新的特征使用与其他特征相同的k个哈希函数映射至位向量中,不必增加新的空间存储特征,特征增加如图3所示。If the data is analyzed to obtain new features, it is only necessary to map the new features to the bit vector using the same k hash functions as other features, and it is not necessary to add new spatial storage features, as shown in Figure 3.

设计算法2:数据特征指纹构建Design Algorithm 2: Data Feature Fingerprint Construction

输入:位数组长度m,哈希函数k,数据特征集SInput: bit array length m, hash function k, data feature set S

输出:数据特征指纹Output: Data Feature Fingerprint

Figure BDA0003718227950000111
Figure BDA0003718227950000111

Figure BDA0003718227950000121
Figure BDA0003718227950000121

布隆过滤器可以在不获取数据内容的情况下快速判断出数据是否在集合内,有效地保护了数据隐私。同时,在进行成员查询时,布隆过滤器不会出现漏报,可用于一些需要快速判断某个数据是否属于某集合但不需要较高准确率的场合。The Bloom filter can quickly determine whether the data is in the collection without obtaining the data content, which effectively protects data privacy. At the same time, when querying members, the Bloom filter will not have false negatives, which can be used in some occasions where it is necessary to quickly determine whether a certain data belongs to a certain set but does not require a high accuracy rate.

此外,布隆过滤器通过常数级的复杂度进行查询成员关系,其空间效率和查询时间都远远小于其他数据结构,弥补了传统哈希表查询的不足。In addition, the Bloom filter queries membership with constant-level complexity, and its space efficiency and query time are far less than other data structures, making up for the shortcomings of traditional hash table queries.

同时,由于现有大多数应用中数据特征一般是增量扩展的,几乎不存在特征删除的情况,因此其特征指纹维护的需求主要是新生成特征的插入。采用布隆过滤器构建数据特征指纹能够最大限度的利用了数据特征的增量特性,既保持了布隆滤波器的空间利用率、查询效率等方面的优点,又可避免删除操作的可维护性成本。At the same time, since the data features in most existing applications are generally incrementally expanded, and there is almost no feature deletion, the maintenance requirement of the feature fingerprint is mainly the insertion of newly generated features. The use of Bloom filters to construct data feature fingerprints can maximize the use of the incremental characteristics of data features, which not only maintains the advantages of the Bloom filter in terms of space utilization and query efficiency, but also avoids the maintainability of deletion operations. cost.

数据特征相似性度量:Data feature similarity measure:

汉明距离(Hamming Distance)常被用于数据传输差错控制编码,它表示两个相同长度的码字对应位数值不同的个数,对于一个二进制字符串a和b,汉明距离等于a XOR b运算结果中1的个数。Hamming Distance is often used for data transmission error control coding. It indicates that two codewords of the same length correspond to different numbers of bits. For a binary string a and b, the Hamming distance is equal to a XOR b The number of 1's in the result of the operation.

例如,长度为n的字符串序列a=(a1,a2…an)与b=(b1,b2…bn),其汉明距离的计算公式如下:For example, for a string sequence of length n a=(a 1 , a 2 ··· a n ) and b=(b 1 , b 2 ··· b n ), the Hamming distance is calculated as follows:

Figure BDA0003718227950000122
Figure BDA0003718227950000122

数据特征的相似度可用一对BF位向量之间的汉明距离来衡量。由于不同数据的相同特征可以被散列到BF的相同位置,数据含有的相同的特征值越多,两个数据对应的特征指纹的汉明距离也会越短。The similarity of data features can be measured by the Hamming distance between a pair of BF bit vectors. Since the same features of different data can be hashed to the same position of the BF, the more the data contains the same feature values, the shorter the Hamming distance of the feature fingerprints corresponding to the two data.

例如,数据A对应的位向量字符串为“0100100001100010010”,数据B对应的位向量字符串为“0001100001101010010”,数据C“0001100111000101001”,可以得到数据A特征化后与数据B特征化后的汉明距离为4,数据A特征化后与数据C特征化后的汉明距离为10,因此数据A与数据C有更短的汉明距离,即数据A与数据C拥有的数据特征更相似。For example, the bit vector string corresponding to data A is "0100100001100010010", the bit vector string corresponding to data B is "0001100001101010010", and data C is "0001100111000101001", the Hamming of data A and data B after characterization can be obtained. The distance is 4, and the Hamming distance between data A and data C after characterization is 10, so data A and data C have a shorter Hamming distance, that is, data A and data C have more similar data features.

因此,可通过汉明距离有效地对数据特征相似度进行度量,步骤为:Therefore, the similarity of data features can be effectively measured by the Hamming distance. The steps are:

D1、于数据特征指纹集中选取任意两个数据的特征指纹;D1. Select the feature fingerprints of any two data from the data feature fingerprint set;

D2、计算两个数据特征指纹之间的汉明距离;D2. Calculate the Hamming distance between two data feature fingerprints;

D3、基于汉明距离度量两个数据特征的相似性。D3. Measure the similarity of two data features based on the Hamming distance.

设计算法3:数据特征相似性度量Design Algorithm 3: Data Feature Similarity Measurement

输入:数据特征指纹集FnInput: data feature fingerprint set Fn

输出:数据特征相似性Output: data feature similarity

for(i=0;i<n;i++)dofor(i=0; i<n; i++)do

Figure BDA0003718227950000131
Figure BDA0003718227950000131

Output ResultOutput Result

现有的数据相似度检测技术有Shingle及Min-Wise和Mins技术,简单易实现且适用性广,但仍然存在一定的缺陷:例如,Shingle技术用两个文件数据连续单词组成词汇的交集计算文件的相似性和包含性,计算开销很高;选用布隆过滤器进行相似数据检测,可以弥补Shingle中因特征集交集计算文件相似性所导致的高计算和存储空间开销问题,在性能与相似性匹配精度之间取得平衡。Existing data similarity detection technologies include Shingle, Min-Wise and Mins technologies, which are simple and easy to implement and have wide applicability, but there are still certain defects: For example, Shingle technology uses two document data consecutive words to form the intersection of vocabulary and calculates documents. The similarity and inclusion of the algorithm are very high, and the computational cost is very high; the use of Bloom filter for similar data detection can make up for the high computational and storage space cost caused by the similarity of files calculated by the intersection of feature sets in Shingle. Balance between matching accuracy.

此外,当异构数据在多模态中存在相似数据特征时,面对大体量的数据特征,传统的数据特征两两比较方式会造成极大的计算资源开销,所以结合倒排索引检索方法可使数据特征相似性判别效率更高。In addition, when the heterogeneous data has similar data features in multiple modalities, in the face of a large amount of data features, the traditional pairwise comparison method of data features will cause great computational resource overhead, so the combination of the inverted index retrieval method can be used. Make the data feature similarity discrimination more efficient.

数据特征指纹的索引库构建,是为了弥补在指纹库中逐一比对消耗大量计算资源的不足,以提高相似度度量速度;包括倒排索引库的建立和基于倒排索引库的快捷检索。The construction of the index database of data feature fingerprints is to make up for the shortage of a large amount of computing resources consumed by one-by-one comparison in the fingerprint database, so as to improve the speed of similarity measurement. It includes the establishment of the inverted index database and the fast retrieval based on the inverted index database.

倒排索引库的构建,如图4所示,包括以下步骤:The construction of the inverted index library, as shown in Figure 4, includes the following steps:

F1、首先将每个模态中已有数据特征的指纹值进行分段操作:假设数据特征化后的指纹值有m位(m的取值与数据特征大小有关,此处取32位),取4位(即c=4)为一组,共分为

Figure BDA0003718227950000142
段,若两个数据特征化后指纹汉明距离小于等于n,则在分段后至少有8-n段的数值是相同的;F1. First perform segment operation on the fingerprint value of the existing data features in each modality: assuming that the fingerprint value after data characterization has m bits (the value of m is related to the size of the data feature, here is 32 bits), Take 4 bits (that is, c=4) as a group, which are divided into
Figure BDA0003718227950000142
If the Hamming distance of the fingerprints after the characterization of the two data is less than or equal to n, then at least 8-n segments have the same value after the segmentation;

F2、以每段的指纹值为指纹段值,并作为索引键;F2. Take the fingerprint value of each segment as the fingerprint segment value and use it as the index key;

F3、将同段位具有相同指纹段值的数据编号作为索引值;F3. Use the data number with the same fingerprint segment value in the same segment as the index value;

F4、将数据特征指纹以<指纹段值,数据编号>作为键值对,存入特征指纹索引表中,构成倒排索引库。F4. The data feature fingerprint is stored in the feature fingerprint index table with <fingerprint segment value, data number> as a key-value pair to form an inverted index library.

算法设计4:构建倒排索引库Algorithm Design 4: Build an Inverted Index Library

输入:数据特征指纹集Fn,指纹位数m,每块位数cInput: data feature fingerprint set Fn, fingerprint bits m, bits per block c

输出:数据特征索引表<key,value>Output: data feature index table <key, value>

Figure BDA0003718227950000141
Figure BDA0003718227950000141

快捷检索,将新生成的数据特征的指纹值或其他模态中的数据特征的指纹值进行分段操作,同样分为

Figure BDA0003718227950000153
段(步骤G1);然后以每一段的指纹值(指纹段值)作为索引键,在倒排索引库中查找每一段对应的索引值(步骤G2),该索引值为每一个对应的数据编号集合(步骤G3);在查找完每一块的集合后,仅需提取次数不低于
Figure BDA0003718227950000154
次的指纹编号,使用提取时每个指纹编号出现的次数与提取总次数的比作为检索数据特征与数据特征指纹库中已有数据特征的相似性,具体公式为:Quick retrieval, segment the fingerprint value of the newly generated data feature or the fingerprint value of the data feature in other modalities, and also divide it into
Figure BDA0003718227950000153
segment (step G1); then use the fingerprint value (fingerprint segment value) of each segment as the index key, search for the index value corresponding to each segment in the inverted index library (step G2), the index value is each corresponding data number Set (step G3); after finding the set of each block, only the number of times of extraction is not less than
Figure BDA0003718227950000154
The ratio of the number of occurrences of each fingerprint number to the total number of times of extraction is used as the similarity between the retrieved data feature and the existing data feature in the data feature fingerprint database. The specific formula is:

Figure BDA0003718227950000151
Figure BDA0003718227950000151

其中,num标记每次提取时对应指纹编号是否出现,若出现则num(b)为1,否则为0。Among them, num marks whether the corresponding fingerprint number appears each time it is extracted, if it appears, num(b) is 1, otherwise it is 0.

算法设计5:数据特征查找(快捷检索)Algorithm Design 5: Data Feature Search (Quick Search)

输入:数据特征指纹f,指纹位数m,每块位数c,汉明距离hInput: data feature fingerprint f, fingerprint number m, number of bits per block c, Hamming distance h

输出:数据特征编号Output: data feature number

Figure BDA0003718227950000152
Figure BDA0003718227950000152

以上显示和描述了本发明的基本原理、主要特征和优点。本行业的技术人员应该了解,上述实施例不以任何形式限制本发明,凡采用等同替换或等效变换的方式所获得的技术方案,均落在本发明的保护范围内。The foregoing has shown and described the basic principles, main features and advantages of the present invention. Those skilled in the art should understand that the above-mentioned embodiments do not limit the present invention in any form, and all technical solutions obtained by means of equivalent replacement or equivalent transformation fall within the protection scope of the present invention.

Claims (9)

1.一种数据特征指纹构建,其特征在于,包括以下步骤:1. a data feature fingerprint construction, is characterized in that, comprises the following steps: S1、使用TF-IDF方法对数据进行特征提取;S1. Use the TF-IDF method to perform feature extraction on the data; S2、基于布隆过滤器结构,将不同数据的特征通过哈希运算映射至对应BF的位向量表中,将最终位向量中0和1构成的序列作为该数据特征的指纹输出。S2. Based on the Bloom filter structure, the features of different data are mapped to the bit vector table corresponding to the BF through a hash operation, and the sequence formed by 0 and 1 in the final bit vector is output as the fingerprint of the data feature. 2.根据权利要求1所述的数据特征指纹构建,其特征在于,所述步骤S1中的使用TF-IDF方法,包括以下步骤:2. data feature fingerprint construction according to claim 1, is characterized in that, the use TF-IDF method in described step S1, comprises the following steps: A1、通过下式(1)计算提取的数据特征的词频TF:A1. Calculate the word frequency TF of the extracted data features by the following formula (1):
Figure FDA0003718227940000011
Figure FDA0003718227940000011
其中,i表示词的索引,j表示数据的文本索引,ni,j表示第i个词在第j个文本中出现的次数,分母表示第j个文本中的词的总数;即,TF值为某个词在一个文本中出现的频次与该文本的词的总数的比值;Among them, i represents the index of the word, j represents the text index of the data, n i,j represents the number of times the i-th word appears in the j-th text, and the denominator represents the total number of words in the j-th text; that is, the TF value is the ratio of the frequency of a word in a text to the total number of words in the text; A2、通过下式(2)计算逆文本频率IDF:A2. Calculate the inverse text frequency IDF by the following formula (2):
Figure FDA0003718227940000012
Figure FDA0003718227940000012
其中,M表示文本的总数,mi表示包含第i个词的文本的数量,ɑ表示经验系数,一般取0.01;Among them, M represents the total number of texts, m i represents the number of texts containing the i-th word, and ɑ represents the empirical coefficient, generally 0.01; A3、每个文本向量的特征项对应的TF-IDF权重,通过下式(3)计算:A3. The TF-IDF weight corresponding to the feature item of each text vector is calculated by the following formula (3): TFIDFi,j=TFi,j·IDFi (3)TFIDF i,j =TF i,j ·IDF i (3) A4、利用TF-IDF权重法计算出各个特征项的权重后,选择权重较大的数据作为文本的特征。A4. After calculating the weight of each feature item using the TF-IDF weight method, select data with a larger weight as the feature of the text.
3.根据权利要求1所述的数据特征指纹构建,其特征在于,所述步骤S1还包括步骤S10、对待提取的数据进行预处理,包括:3. The data feature fingerprint construction according to claim 1, wherein the step S1 further comprises a step S10, preprocessing the data to be extracted, including: B1、清洗数据:以文本数据为例,包括删除HTML标签、非字母数字字符的特殊字符和重音字符、停止词;B1. Cleaning data: Take text data as an example, including deleting HTML tags, non-alphanumeric characters, special characters and accented characters, and stop words; B2、扩展数据中的缩略语。B2. Abbreviations in extended data. 4.根据权利要求1所述的数据特征指纹构建,其特征在于,所述步骤S1还包括步骤S12、整理所提取的特征,使特征标准化,包括以下步骤:4. The data feature fingerprint construction according to claim 1, wherein the step S1 further comprises a step S12, sorting the extracted features to standardize the features, comprising the following steps: C1、采用余弦相似度计算不同特征之间的相似度;C1. Use cosine similarity to calculate the similarity between different features; C2、设置相似度阈值;C2. Set the similarity threshold; C3、将特征两两比对;若相似度大于阈值,则删除重复特征;C3. Compare the features in pairs; if the similarity is greater than the threshold, delete the duplicate features; C4、重复步骤C3至特征唯一。C4. Repeat step C3 until the feature is unique. 5.根据权利要求4所述的数据特征指纹构建,其特征在于,所述步骤C1中余弦相似度的计算,具体为:5. data feature fingerprint construction according to claim 4, is characterized in that, the calculation of cosine similarity in described step C1, is specifically: 计算文本特征向量之间的距离,从而得到文本之间的相似度;一般来说,两个文本向量的距离越近,则两个文本越相似;Calculate the distance between text feature vectors to obtain the similarity between texts; in general, the closer the distance between two text vectors, the more similar the two texts are; 采用余弦相似度计算不同特征之间的相似度,公式为:The cosine similarity is used to calculate the similarity between different features, and the formula is:
Figure FDA0003718227940000021
Figure FDA0003718227940000021
其中,A=(x1,x2,…xn),B=(y1,y2,…yn),为文本向量。Among them, A=(x 1 , x 2 ,...x n ), B=(y 1 , y 2 ,... y n ), which are text vectors.
6.根据权利要求1所述的数据特征指纹构建,其特征在于,所述步骤S2中对数据特征进行哈希散列并映射到BF相应的位向量表中,具体为:6. data feature fingerprint construction according to claim 1 is characterized in that, in described step S2, data feature is hashed and mapped in the corresponding bit vector table of BF, specifically: 给定一组数据集A={a1,a2,…an},使用k个哈希函数{h1,h2,…hk}得到待插入数据特征的索引值hi(aj),其中i∈[1,k],j∈[1,n];Given a set of data sets A={a 1 , a 2 , ... a n }, use k hash functions {h 1 , h 2 , ... h k } to obtain the index values of the data features to be inserted h i (a j ), where i∈[1,k], j∈[1,n]; 初始状态时,位向量表中所有值都置为0;插入时,将索引值hi(aj)对应的k个位置的位向量值置为1;In the initial state, all values in the bit vector table are set to 0; when inserting, the bit vector values of the k positions corresponding to the index value h i (a j ) are set to 1; 将最终的位向量表中0和1构成的序列作为数据特征的指纹输出。The sequence composed of 0 and 1 in the final bit vector table is output as the fingerprint of the data feature. 7.适用于由权利要求1-6任一项的数据特征指纹的相似性度量方法,其特征在于,7. is applicable to the similarity measuring method of the data characteristic fingerprint of any one of claim 1-6, it is characterized in that, 由于不同数据的相同特征可以被散列到BF的相同位置,数据含有的相同的特征值越多,两个数据对应的特征指纹的汉明距离会越短,因此可通过汉明距离对数据特征相似度进行度量,步骤包括:Since the same features of different data can be hashed to the same position of the BF, the more the same feature values are contained in the data, the shorter the Hamming distance of the feature fingerprints corresponding to the two data will be, so the Hamming distance can be used to compare the data features. Similarity is measured, and the steps include: D1、于数据特征指纹集中选取任意两个数据特征的指纹;D1. Select fingerprints of any two data features from the data feature fingerprint set; D2、计算两个数据特征指纹之间的汉明距离;D2. Calculate the Hamming distance between two data feature fingerprints; D3、基于汉明距离度量两个数据特征的相似性。D3. Measure the similarity of two data features based on the Hamming distance. 8.适用于权利要求1-6任一项的数据特征指纹的索引,其特征在于,还包括倒排索引库的建立,包括以下步骤:8. be applicable to the index of the data characteristic fingerprint of any one of claim 1-6, it is characterized in that, also comprise the establishment of inverted index library, comprise the following steps: F1、将数据分别编号,以c为段长,将长度为m的数据的特征指纹分为
Figure FDA0003718227940000031
段;
F1. Number the data respectively, take c as the segment length, and divide the feature fingerprint of the data with length m into
Figure FDA0003718227940000031
part;
F2、以每段的指纹值为指纹段值,并作为索引键;F2. Take the fingerprint value of each segment as the fingerprint segment value and use it as the index key; F3、将同段位具有相同指纹段值的数据编号作为索引值;F3. Use the data number with the same fingerprint segment value in the same segment as the index value; F4、将数据特征指纹以<指纹段值,数据编号>作为键值对,存入特征指纹索引表中,构成倒排索引库。F4. The data feature fingerprint is stored in the feature fingerprint index table with <fingerprint segment value, data number> as a key-value pair to form an inverted index library.
9.检索,其特征在于,适用于权利要求8所述的倒排索引库,包括以下步骤:9. retrieval, is characterized in that, is applicable to the described inverted index library of claim 8, comprises the following steps: G1、将待检索的数据特征指纹,以c为段长,分为
Figure FDA0003718227940000032
段;
G1. Divide the feature fingerprint of the data to be retrieved, with c as the segment length, into
Figure FDA0003718227940000032
part;
G2、以指纹段值进行索引并获得每一段的数据编号;G2. Index with the fingerprint segment value and obtain the data number of each segment; G3、将数据编号加入集合;G3. Add the data number to the collection; G4、查找编号出现次数大于
Figure FDA0003718227940000033
次的数据,h为汉明距离;
G4, the number of occurrences of the search number is greater than
Figure FDA0003718227940000033
times data, h is the Hamming distance;
G5、利用公式(5)计算待检索数据特征与索引库中已有特征的相似性:G5. Use formula (5) to calculate the similarity between the features of the data to be retrieved and the existing features in the index database:
Figure FDA0003718227940000041
Figure FDA0003718227940000041
式中,假设段长c值为4,num标记每次提取时对应指纹编号是否出现,若出现则num(b)为1,否则为0。In the formula, it is assumed that the segment length c is 4, and num marks whether the corresponding fingerprint number appears each time it is extracted. If it appears, num(b) is 1, otherwise it is 0.
CN202210741611.6A 2022-06-28 2022-06-28 Data feature fingerprint construction and similarity measurement method and index Pending CN115204147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210741611.6A CN115204147A (en) 2022-06-28 2022-06-28 Data feature fingerprint construction and similarity measurement method and index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210741611.6A CN115204147A (en) 2022-06-28 2022-06-28 Data feature fingerprint construction and similarity measurement method and index

Publications (1)

Publication Number Publication Date
CN115204147A true CN115204147A (en) 2022-10-18

Family

ID=83577816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210741611.6A Pending CN115204147A (en) 2022-06-28 2022-06-28 Data feature fingerprint construction and similarity measurement method and index

Country Status (1)

Country Link
CN (1) CN115204147A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227684A (en) * 2024-05-23 2024-06-21 中国电子科技集团公司第三十研究所 A data behavior tracing method based on multi-dimensional discrete data fingerprint
WO2025038334A1 (en) * 2023-08-14 2025-02-20 Benchling, Inc. Searching of chemical structures
WO2025180074A1 (en) * 2024-02-29 2025-09-04 华为技术有限公司 Data transmission system and method, and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179893A1 (en) * 2014-12-22 2016-06-23 Blackberry Limited Method and system for efficient feature matching
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A Similarity Retrieval Method of Alignment Matrix Based on Multi-stage Fingerprint
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 A Quick Comparison Method for Program Code Similarity Based on Abstract Syntax Tree
CN110704645A (en) * 2019-08-22 2020-01-17 中国人民解放军军事科学院评估论证研究中心 Corpus full-text retrieval method and system based on fingerprints
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN114281989A (en) * 2021-12-06 2022-04-05 重庆邮电大学 Data deduplication method and device based on text similarity, storage medium and server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179893A1 (en) * 2014-12-22 2016-06-23 Blackberry Limited Method and system for efficient feature matching
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A Similarity Retrieval Method of Alignment Matrix Based on Multi-stage Fingerprint
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 A Quick Comparison Method for Program Code Similarity Based on Abstract Syntax Tree
CN110704645A (en) * 2019-08-22 2020-01-17 中国人民解放军军事科学院评估论证研究中心 Corpus full-text retrieval method and system based on fingerprints
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN114281989A (en) * 2021-12-06 2022-04-05 重庆邮电大学 Data deduplication method and device based on text similarity, storage medium and server

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025038334A1 (en) * 2023-08-14 2025-02-20 Benchling, Inc. Searching of chemical structures
WO2025180074A1 (en) * 2024-02-29 2025-09-04 华为技术有限公司 Data transmission system and method, and device
CN118227684A (en) * 2024-05-23 2024-06-21 中国电子科技集团公司第三十研究所 A data behavior tracing method based on multi-dimensional discrete data fingerprint

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN108573045B (en) Comparison matrix similarity retrieval method based on multi-order fingerprints
CN115204147A (en) Data feature fingerprint construction and similarity measurement method and index
US7433869B2 (en) Method and apparatus for document clustering and document sketching
CN107784110B (en) A kind of index establishment method and apparatus
CN101084499B (en) Systems and methods for searching and storing data
WO2023071118A1 (en) Method and system for calculating text similarity, device, and storage medium
US8244767B2 (en) Composite locality sensitive hash based processing of documents
CN108280130A (en) A method of finding sensitive data in text big data
Sood et al. Probabilistic near-duplicate detection using simhash
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN106980656B (en) A kind of searching method based on two-value code dictionary tree
CN106503223B (en) An online housing search method and device combining location and keyword information
CN102110123B (en) Method for establishing inverted index
CN101976253A (en) Chinese variation text matching recognition method
US10146817B2 (en) Inverted index and inverted list process for storing and retrieving information
US10545960B1 (en) System and method for set overlap searching of data lakes
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
US9600578B1 (en) Inverted index and inverted list process for storing and retrieving information
CN116451675A (en) A detection and optimization method for similar duplicate records based on the density clustering algorithm DBSCAN algorithm
CN112417082B (en) A Disambiguation Archiving and Storage Method of Scientific Research Achievement Data
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN115309978A (en) Webpage processing method based on key long sentence and text length pre-classification
CN114330336A (en) New word discovery method and device based on left-right information entropy and mutual information
CN116227477A (en) An Extraction-Based Method for Automatically Generating News Headlines of Topic Clusters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination