[go: up one dir, main page]

CN113868297A - Sensitive data analysis method and device, terminal equipment and storage medium - Google Patents

Sensitive data analysis method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN113868297A
CN113868297A CN202111137129.3A CN202111137129A CN113868297A CN 113868297 A CN113868297 A CN 113868297A CN 202111137129 A CN202111137129 A CN 202111137129A CN 113868297 A CN113868297 A CN 113868297A
Authority
CN
China
Prior art keywords
data
sampled
sensitive
updated
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111137129.3A
Other languages
Chinese (zh)
Inventor
彭龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202111137129.3A priority Critical patent/CN113868297A/en
Publication of CN113868297A publication Critical patent/CN113868297A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请实施例适用于人工智能技术领域,提供了一种敏感数据分析方法、装置、终端设备及存储介质,该方法包括:对服务器返回的数据流进行拷贝,得到拷贝数据流;针对拷贝数据流中的任一当前数据,根据当前数据被采样的处理顺序M,以

Figure DDA0003282457380000011
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据;对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。采用上述方法可以使终端设备对于数量未知的数据进行降采样时,能够提供相同的被采样的概率值对其进行降采样处理。

Figure 202111137129

The embodiments of the present application are applicable to the technical field of artificial intelligence, and provide a sensitive data analysis method, device, terminal device and storage medium. The method includes: copying the data stream returned by the server to obtain the copied data stream; Any current data in , according to the processing order M in which the current data is sampled, to

Figure DDA0003282457380000011
Update the determined N sampled data with a probability of , to obtain the updated N sampled data; perform sensitive word analysis on the updated N sampled data to obtain the sensitive word result. By using the above method, when the terminal device performs down-sampling on data of unknown quantity, it can provide the same sampled probability value to perform down-sampling processing on it.

Figure 202111137129

Description

敏感数据分析方法、装置、终端设备及存储介质Sensitive data analysis method, device, terminal device and storage medium

技术领域technical field

本申请属于人工智能技术领域,尤其涉及一种敏感数据分析方法、装置、终端设备及存储介质。The present application belongs to the technical field of artificial intelligence, and in particular, relates to a sensitive data analysis method, device, terminal device and storage medium.

背景技术Background technique

随着数据时代的到来,数据中蕴藏的巨大价值得以挖掘,但是对数据中属于敏感数据的保护也变得困难。其中,常见的敏感数据有姓名、身份证号码、住址、电话、银行账号等数据,其均为个人的隐私信息。With the advent of the data era, the huge value contained in the data has been mined, but the protection of the sensitive data in the data has also become difficult. Among them, common sensitive data include name, ID number, address, phone number, bank account number and other data, which are all personal private information.

目前,对于服务器返回至客户端的数据流,因数据流中通常包含有大量的数据,通常需要通过已有的采样方法对数据进行降采样,以减少所需处理的数据的数量。然而,因每次数据流中包含的数据的数量未知,且每次包含的数量也各不相同,采用已有的降采样方法,对数据流中的数据进行降采样时,均需预先统计数据流中包含的数据的总数量。然后,计算总数量的倒数得到每个数据被采样的概率值。因此,现有的降采样方法具有一定的局限性,无法提供相同的被采样的概率值对数量未知的数据进行降采样处理。At present, for the data stream returned from the server to the client, because the data stream usually contains a large amount of data, it is usually necessary to downsample the data through an existing sampling method to reduce the amount of data to be processed. However, because the quantity of data contained in each data stream is unknown and the quantity contained each time is also different, using the existing downsampling method, when downsampling the data in the data stream, it is necessary to count the data in advance The total amount of data contained in the stream. Then, calculate the inverse of the total number to get the probability value of each data being sampled. Therefore, the existing down-sampling methods have certain limitations, and cannot provide the same sampled probability value to down-sample data with an unknown quantity.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种敏感数据分析方法、装置、终端设备及存储介质,可以解决现有的降采样方法具有一定的局限性,无法提供相同的被采样的概率值对数量未知的数据进行降采样处理的问题。The embodiments of the present application provide a sensitive data analysis method, device, terminal device, and storage medium, which can solve the limitation of the existing downsampling method, which cannot provide the same sampled probability value for an unknown amount of data. Downsampling issues.

第一方面,本申请实施例提供了一种敏感数据分析方法,该方法包括:In a first aspect, an embodiment of the present application provides a sensitive data analysis method, the method comprising:

对服务器返回的数据流进行拷贝,得到拷贝数据流;Copy the data stream returned by the server to obtain the copied data stream;

针对拷贝数据流中的任一当前数据,根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000021
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据;For any current data in the copy data stream, according to the processing order M in which the current data is sampled, to
Figure BDA0003282457360000021
Update the determined N sampled data with the probability of , and obtain the updated N sampled data;

对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。Sensitive word analysis is performed on the updated N sampled data to obtain sensitive word results.

在一实施例中,对服务器返回的数据流进行拷贝,得到拷贝数据流,包括:In one embodiment, the data stream returned by the server is copied to obtain the copied data stream, including:

针对数据流中的任一数据,识别数据的数据结构;For any data in the data stream, identify the data structure of the data;

对数据中属于目标数据结构的数据进行拷贝,得到数据对应的拷贝数据。Copy the data belonging to the target data structure in the data to obtain the copy data corresponding to the data.

在一实施例中,在根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000022
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据之前,还包括:In one embodiment, in the processing order M in which the current data is sampled, to
Figure BDA0003282457360000022
The probability of updating the determined N sampled data, and before obtaining the updated N sampled data, also includes:

若M≤N,则确定当前数据为采样数据,直至得到已确定的N个采样数据。If M≤N, it is determined that the current data is the sampled data, until the determined N sampled data are obtained.

在一实施例中,根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000023
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据,包括:In one embodiment, according to the processing order M in which the current data is sampled, to
Figure BDA0003282457360000023
Update the determined N sampled data with the probability of , and obtain the updated N sampled data, including:

若以

Figure BDA0003282457360000024
的概率确定当前数据不为采样数据,则保持已确定的N个采样数据不变;if
Figure BDA0003282457360000024
The probability of determining that the current data is not sampled data, then keep the determined N sampled data unchanged;

若以

Figure BDA0003282457360000025
的概率确定当前数据为采样数据,则以
Figure BDA0003282457360000026
的概率从已确定的N个采样数据中,确定一个需被当前数据更新的替换采样数据;if
Figure BDA0003282457360000025
The probability of determining that the current data is sampled data, then use
Figure BDA0003282457360000026
The probability of determining a replacement sampled data that needs to be updated by the current data from the N sampled data that have been determined;

将当前数据更新替换采样数据,得到更新后的N个采样数据。Update the current data to replace the sampled data to obtain the updated N sampled data.

在一实施例中,对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果,包括:In one embodiment, sensitive word analysis processing is performed on the updated N sampled data to obtain sensitive word results, including:

针对任一采样数据,识别采样数据,生成文本信息;For any sampled data, identify the sampled data and generate text information;

对文本信息进分词,得到文本信息的多个文本分词;Segment the text information to obtain multiple text segmentations of the text information;

根据多个文本分词分别在预设词向量库中的位置信息,确定多个文本分词的分词向量;Determine the word segmentation vectors of the multiple text segmentations according to the respective position information of the multiple text segmentations in the preset word vector library;

根据分词向量分别对多个文本分词进行识别,得到采样数据的敏感词结果。According to the word segmentation vector, multiple text segmentations are identified, and the sensitive word results of the sampled data are obtained.

在一实施例中,在对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果之后,还包括:In one embodiment, after performing sensitive word analysis processing on the updated N sampled data to obtain a sensitive word result, the method further includes:

根据敏感词结果,统计包含敏感词的采样数据的数量;According to the results of sensitive words, count the number of sampled data containing sensitive words;

若包含敏感词的采样数据的数量与N之间的比值大于预设值,则对服务器返回的数据流均进行敏感词分析处理。If the ratio between the number of sampled data containing sensitive words and N is greater than the preset value, the sensitive word analysis processing is performed on the data stream returned by the server.

在一实施例中,若包含敏感词的采样数据的数量与N之间的比值大于预设值,则对服务器返回的数据流均进行敏感词分析处理中,包括:In one embodiment, if the ratio between the number of sampled data containing sensitive words and N is greater than a preset value, then performing sensitive word analysis processing on all data streams returned by the server, including:

对服务器返回的数据流中包含的敏感词进行脱敏处理,得到脱敏数据;Perform desensitization processing on the sensitive words contained in the data stream returned by the server to obtain desensitized data;

将脱敏数据发送至客户端。Send desensitized data to the client.

第二方面,本申请实施例提供了一种敏感数据分析装置,该装置包括:In a second aspect, an embodiment of the present application provides a sensitive data analysis device, the device comprising:

拷贝模块,用于对服务器返回的数据流进行拷贝,得到拷贝数据流;The copy module is used to copy the data stream returned by the server to obtain the copied data stream;

采样模块,用于针对拷贝数据流中的任一当前数据,根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000031
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据;The sampling module is used for any current data in the copy data stream, according to the processing sequence M in which the current data is sampled, to
Figure BDA0003282457360000031
Update the determined N sampled data with the probability of , and obtain the updated N sampled data;

分析模块,用于对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。The analysis module is used to analyze and process sensitive words on the updated N sampled data to obtain the results of sensitive words.

第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第一方面中任一项的方法。In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, any one of the above-mentioned first aspects is implemented. a method.

第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,其特征在于,计算机程序被处理器执行时实现如上述第一方面中任一项的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, wherein the computer program is executed by a processor to implement any one of the above-mentioned first aspects. method.

第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute the method of any one of the above-mentioned first aspects.

本申请实施例与现有技术相比存在的有益效果是:先通过对服务器返回至客户端的数据流进行拷贝后,对拷贝数据流进行处理,而不直接对返回的数据流进行处理,可在对数据流进行敏感词分析处理基础上,还可保证服务器返回至客户端的数据不会造成延时的问题。之后,基于对拷贝数据流中当前数据被采样的处理顺序M,以

Figure BDA0003282457360000041
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据。以此,可以使终端设备在接收到当前数据时即进行采样处理,而不是在接收数据流中所有的数据后在进行采样处理。并且,可避免终端设备使用现有降采样方法的局限性,使终端设备能够提供相同的被采样的概率值对数量未知的数据进行降采样处理。进而,可以使终端设备在对采样效果更好的采样数据进行敏感词分析处理时,其得到的敏感词结果更接近于整体的数据流中各个数据的敏感词结果。Compared with the prior art, the embodiment of the present application has the beneficial effect of: first, after copying the data stream returned by the server to the client, the copied data stream is processed instead of directly processing the returned data stream, which can be stored in the On the basis of analyzing and processing the sensitive words on the data stream, it can also ensure that the data returned by the server to the client will not cause delay problems. After that, based on the processing order M in which the current data in the copied data stream is sampled,
Figure BDA0003282457360000041
Update the determined N sampled data with the probability of , and obtain the updated N sampled data. In this way, the terminal device can perform sampling processing when receiving the current data, instead of performing sampling processing after receiving all the data in the data stream. Moreover, the limitation of using the existing down-sampling method by the terminal device can be avoided, so that the terminal device can provide the same sampled probability value to perform down-sampling processing on an unknown amount of data. Furthermore, when the terminal device performs sensitive word analysis processing on the sampled data with better sampling effect, the sensitive word result obtained by the terminal device is closer to the sensitive word result of each data in the overall data stream.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本申请一实施例提供的一种敏感数据分析方法的实现流程图;FIG. 1 is an implementation flowchart of a sensitive data analysis method provided by an embodiment of the present application;

图2是本申请一实施例提供的一种敏感数据分析方法的S102的一种实现方式示意图;FIG. 2 is a schematic diagram of an implementation manner of S102 of a sensitive data analysis method provided by an embodiment of the present application;

图3是本申请一实施例提供的一种敏感数据分析方法的S103的一种实现方式示意图;FIG. 3 is a schematic diagram of an implementation manner of S103 of a sensitive data analysis method provided by an embodiment of the present application;

图4是本申请一实施例提供的一种敏感数据分析方法的S101的一种实现方式示意图;FIG. 4 is a schematic diagram of an implementation manner of S101 of a sensitive data analysis method provided by an embodiment of the present application;

图5是本申请另一实施例提供的一种敏感数据分析方法的实现流程图;FIG. 5 is an implementation flowchart of a sensitive data analysis method provided by another embodiment of the present application;

图6是本申请一实施例提供的一种敏感数据分析方法的S132的一种实现方式示意图;FIG. 6 is a schematic diagram of an implementation manner of S132 of a sensitive data analysis method provided by an embodiment of the present application;

图7是本申请一实施例提供的一种敏感数据分析装置的结构示意图;FIG. 7 is a schematic structural diagram of a sensitive data analysis device provided by an embodiment of the present application;

图8是本申请一实施例提供的一种终端设备的结构示意图。FIG. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

具体实施方式Detailed ways

以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.

另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.

本申请实施例提供的敏感数据分析方法可以应用于平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The sensitive data analysis method provided by the embodiments of the present application can be applied to terminal devices such as tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and the like. No restrictions apply.

请参阅图1,图1示出了本申请实施例提供的一种敏感数据分析方法的实现流程图,该方法包括如下步骤:Please refer to FIG. 1. FIG. 1 shows an implementation flowchart of a sensitive data analysis method provided by an embodiment of the present application. The method includes the following steps:

S101、终端设备对服务器返回的数据流进行拷贝,得到拷贝数据流。S101. The terminal device copies the data stream returned by the server to obtain the copied data stream.

在一实施例中,终端设备可以通过网络流量旁路技术对数据流中的数据进行拷贝,得到拷贝数据流。具体的,上述网络流量旁路技术可以为通过交换机网络设备中的“端口镜像”功能,对服务器返回的数据流进行实时监控和拷贝,且不对服务器返回的数据流造成延时。即服务器原本返回的数据流可直接返回至客户端。In an embodiment, the terminal device may copy the data in the data stream by using the network traffic bypass technology to obtain the copied data stream. Specifically, the above-mentioned network traffic bypass technology can monitor and copy the data flow returned by the server in real time through the "port mirroring" function in the network device of the switch, without causing delay to the data flow returned by the server. That is, the data stream originally returned by the server can be directly returned to the client.

其中,需要说明的是,上述“端口镜像”功能可以通过交换机网络设备进行实现,也可以预先在接收数据流的客户端中,设置可实现“端口镜像”功能的软件应用程序,以对服务器返回至该客户端中的数据流进行拷贝。可以理解的是,客户端在通过“端口镜像”功能的软件应用程序对数据流进行拷贝后,可将拷贝数据流传输至终端设备中进行后续处理,对此不作限定。Among them, it should be noted that the above-mentioned "port mirroring" function can be implemented through the switch network device, or a software application program that can realize the "port mirroring" function can be set in the client that receives the data stream in advance, so as to return to the server. Copy to the data stream in the client. It can be understood that, after the client side copies the data stream through the software application program of the "port mirroring" function, it can transmit the copied data stream to the terminal device for subsequent processing, which is not limited.

在本实施例中,采用在接收数据流的客户端中,设置可实现“端口镜像”功能的软件应用程序,对数据流进行拷贝。以此,可减少服务器需安装的硬件设备,降低服务器的硬件成本。In this embodiment, a software application program capable of implementing a "port mirroring" function is set in the client that receives the data stream to copy the data stream. In this way, the hardware devices to be installed in the server can be reduced, and the hardware cost of the server can be reduced.

在一实施例中,上述服务器可以同时执行多个客户端的数据请求,因此,上述服务器返回的数据流包括但不限于某个客户端接收的数据流,还包括多个客户端分别接收到的服务器返回的数据流,对此不作限定。In one embodiment, the above-mentioned server can simultaneously execute data requests of multiple clients. Therefore, the data stream returned by the above-mentioned server includes but is not limited to the data stream received by a certain client, and also includes the server received by multiple clients respectively. The returned data stream is not limited.

S102、针对拷贝数据流中的任一当前数据,终端设备根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000061
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据。S102, for any current data in the copy data stream, the terminal device according to the processing sequence M in which the current data is sampled, to
Figure BDA0003282457360000061
Update the determined N sampled data with the probability of , and obtain the updated N sampled data.

在一实施例中,上述当前数据为终端设备当前时刻下处理的数据。其中,一个数据流中通常包括大量的数据,同样的,对应的拷贝数据流中也应当具有大量的数据。基于此,若终端设备对拷贝数据流中所有的数据均进行敏感词分析处理,则将耗费大量的处理时间。因此,终端设备可对多个数据进行随机降采样,得到采样数据,以降低所需进行分析处理的数据的数量。In an embodiment, the above-mentioned current data is data processed at the current moment of the terminal device. Wherein, a data stream usually includes a large amount of data, and similarly, a corresponding copy data stream should also have a large amount of data. Based on this, if the terminal device performs sensitive word analysis processing on all the data in the copied data stream, it will consume a lot of processing time. Therefore, the terminal device can perform random down-sampling on a plurality of data to obtain sampled data, so as to reduce the amount of data that needs to be analyzed and processed.

然而,对拷贝数据流中的多个数据进行随机降采样时,通常需要保证每个数据被采样的概率一致。通常的,终端设备需预先统计数据流中包含的数据的总数量,然后计算总数量的倒数得到采样概率。而后,根据采样概率逐一确定每个数据是否被采样。但是,该方法需要终端设备预先统计数据流中所有数据的总数量。需要说明的是,因每次数据流中包含的数据的数量并不相等,终端设备接收拷贝数据流中的多个数据时,只有在完全接收拷贝数据流后,才可统计拷贝数据流中包含的数据的总数量。However, when randomly down-sampling multiple data in the copied data stream, it is usually necessary to ensure that the probability of each data being sampled is the same. Generally, the terminal device needs to count the total quantity of data contained in the data stream in advance, and then calculate the inverse of the total quantity to obtain the sampling probability. Then, it is determined whether each data is sampled one by one according to the sampling probability. However, this method requires the terminal device to count the total amount of all data in the data stream in advance. It should be noted that, because the amount of data contained in each data stream is not equal, when the terminal device receives multiple data in the copied data stream, it can count the data contained in the copied data stream only after it has completely received the copied data stream. the total amount of data.

因此,为了避免终端设备只有在完全接收拷贝数据流后,才可为每个数据提供相同的采样概率的局限性。终端设备可在接收到数据流中的任意一个当前数据时,即确定接收该当前数据流的处理顺序M。之后,以

Figure BDA0003282457360000062
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据,以使终端设备可在不知数据的总数量时,即可为每个数据提供相同的采样概率。Therefore, in order to avoid the limitation that the terminal device can provide the same sampling probability for each data only after fully receiving the copied data stream. When receiving any current data in the data stream, the terminal device may determine the processing sequence M for receiving the current data stream. After that, with
Figure BDA0003282457360000062
Update the determined N sampled data with the probability of , to obtain the updated N sampled data, so that the terminal device can provide the same sampling probability for each data without knowing the total number of data.

具体的,终端设备可预先设置从每个拷贝数据流中需采样的采样数据的数量N。之后,若判定M≤N,则确定当前数据即为采样数据,得到N个采样数据中的一个采样数据。而后,根据任一当前数据的处理顺序M,重复执行M≤N的判定过程,直至得到N个采样数据。Specifically, the terminal device may preset the number N of sampled data to be sampled from each copied data stream. Afterwards, if it is determined that M≤N, it is determined that the current data is the sampled data, and one sampled data among the N sampled data is obtained. Then, according to the processing sequence M of any current data, the determination process of M≤N is repeatedly performed until N sample data are obtained.

可以理解的是,上述N可以为工作人员预先设定采样数据的数量,其可根据实际情况进行设置,对此不作限定。在本实施例中,对于拷贝数据流中的多个数据,终端设备可根据对多个数据进行采样处理的处理顺序,先将处理顺序处于前N个序号的数据均作为采样数据。因此,可以理解的是,若拷贝数据流中数据的数量不足N个,则拷贝数据流中的任一数据被确定为采样数据的概率均为100%。It can be understood that the above N can be preset for the number of sampling data for the staff, which can be set according to the actual situation, which is not limited. In this embodiment, for multiple pieces of data in the copy data stream, the terminal device may first regard the data whose processing order is in the first N sequence numbers as sampling data according to the processing order of sampling and processing the multiple pieces of data. Therefore, it can be understood that if the number of data in the copied data stream is less than N, the probability that any data in the copied data stream is determined as sampled data is 100%.

在另一实施例中,参照图2,在拷贝数据流中数据的数量超过N个时,则对于任一当前数据,终端设备可根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000071
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据,具体的,其可包括如下子步骤S1021-S1023,详述如下:In another embodiment, referring to FIG. 2 , when the number of data in the copied data stream exceeds N, for any current data, the terminal device may, according to the processing sequence M in which the current data is sampled, to
Figure BDA0003282457360000071
Update the determined N sampled data with the probability of , and obtain the updated N sampled data. Specifically, it may include the following sub-steps S1021-S1023, which are described in detail as follows:

若当前数据的处理顺序M>N时,则终端设备以

Figure BDA0003282457360000072
的概率确定当前数据是否为采样数据。例如,S1021、若终端设备以
Figure BDA0003282457360000073
的概率确定当前数据不为采样数据,则保持已确定的N个采样数据不变。S1022、若终端设备以
Figure BDA0003282457360000074
的概率确定当前数据为采样数据,则以
Figure BDA0003282457360000075
概率从已确定的N个采样数据中,确定一个需被当前数据更新的替换采样数据。而后,S1023、终端设备可以将当前数据更新替换采样数据,得到更新后的N个采样数据。If the processing sequence of the current data is M>N, the terminal device will
Figure BDA0003282457360000072
The probability of determining whether the current data is sampled data. For example, in S1021, if the terminal device starts with
Figure BDA0003282457360000073
The probability of determining that the current data is not sampled data, then keep the determined N sampled data unchanged. S1022. If the terminal device is
Figure BDA0003282457360000074
The probability of determining that the current data is sampled data, then use
Figure BDA0003282457360000075
The probability determines a replacement sample data that needs to be updated by the current data from among the determined N sample data. Then, in S1023, the terminal device may update the current data to replace the sampling data to obtain the updated N sampling data.

具体的,若拷贝数据流中数据的数量超过N个,则从第M(M≥N+1)个数据开始,依次以

Figure BDA0003282457360000081
的概率确定当前数据是否被采样。若确定当前数据被采样,则需要从之前确定的前N个采样数据中随机确定一个需被当前数据更新的替换采样数据。可以理解的是,从M个数据中确定替换采样数据的概率为:
Figure BDA0003282457360000082
其中,
Figure BDA0003282457360000083
表示为从M个数据中选中已确定的N个采样数据的概率;
Figure BDA0003282457360000084
为从已确定的N个采样数据中随机确定的替换采样数据的概率。Specifically, if the number of data in the copied data stream exceeds N, start from the Mth (M≥N+1) data, and then proceed to
Figure BDA0003282457360000081
The probability of determining whether the current data is sampled. If it is determined that the current data is sampled, a replacement sampled data to be updated by the current data needs to be randomly determined from the previously determined first N sampled data. It can be understood that the probability of determining the replacement sample data from the M data is:
Figure BDA0003282457360000082
in,
Figure BDA0003282457360000083
It is expressed as the probability of selecting the determined N sampled data from the M data;
Figure BDA0003282457360000084
It is the probability of replacing sample data randomly determined from the determined N sample data.

基于上述描述可以推导出,在当前数据改变时,即M改变时,其当前数据的被采样概率,和之前已确定的N个采样数据中的任易一个采样数据的被采样概率均是等同的。Based on the above description, it can be deduced that when the current data changes, that is, when M changes, the sampling probability of the current data is equal to the sampling probability of any one of the N sampled data previously determined. .

需要补充的是,在为每个当前数据进行采样处理时,均可以在接收到当前数据时进行处理,以避免终端设备只有在完全接收拷贝数据流后,才可为每个数据提供相同的采样概率的局限性。What needs to be added is that when sampling and processing each current data, it can be processed when the current data is received, so as to avoid that the terminal device can only provide the same sampling for each data after fully receiving the copied data stream. Limitations of probability.

S103、终端设备对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。S103, the terminal device performs sensitive word analysis processing on the updated N sampled data to obtain a sensitive word result.

在一实施例中,终端设备可以基于预先训练的敏感词识别模型对采样数据进行敏感词分析处理,以得到敏感词结果。上述敏感词识别模型可以为采用已有的文本识别模型或文本分类模型,对数据流中的数据进行转换,得到文本信息。之后,利用敏感词识别模型对文本信息进行分词以及特征处理,得到分词向量。最后,基于分词向量,输出每个分词的识别结果。其中,该结果即可认为是上述敏感词结果。In an embodiment, the terminal device may perform sensitive word analysis processing on the sampled data based on a pre-trained sensitive word recognition model to obtain a sensitive word result. The above-mentioned sensitive word recognition model may use an existing text recognition model or text classification model to convert the data in the data stream to obtain text information. After that, use the sensitive word recognition model to segment and feature the text information to obtain the word segmentation vector. Finally, based on the word segmentation vector, output the recognition result of each word segmentation. Among them, the result can be regarded as the above sensitive word result.

具体的,参照图3,在S103对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果中,具体包括如下子步骤S1031-S1034,详述如下:Specifically, referring to FIG. 3 , in S103, the updated N sampled data is subjected to sensitive word analysis processing, and the sensitive word result is obtained, which specifically includes the following sub-steps S1031-S1034, which are described in detail as follows:

S1031、针对任一采样数据,终端设备识别采样数据,生成文本信息。S1031. For any sampled data, the terminal device identifies the sampled data and generates text information.

S1032、终端设备对文本信息进分词,得到文本信息的多个文本分词。S1032. The terminal device performs word segmentation on the text information to obtain multiple text word segmentations of the text information.

S1033、终端设备根据多个文本分词分别在预设词向量库中的位置信息,确定多个文本分词的分词向量。S1033: The terminal device determines word segmentation vectors of the multiple text segmentations according to the respective position information of the multiple text segmentations in the preset word vector library.

S1034、终端设备根据分词向量分别对多个文本分词进行识别,得到采样数据的敏感词结果。S1034 , the terminal device recognizes the word segmentation of multiple texts according to the word segmentation vector, and obtains the result of the sensitive word of the sampled data.

在一实施例中,上述对数据进行识别,并转换为文本信息的过程为已有技术,对此不作说明。In an embodiment, the above-mentioned process of identifying data and converting it into text information is known in the art, and will not be described.

在一实施例中,终端设备对文本信息进行分词以及向量的处理方式可以为:终端设备预先构建预设词向量库,并为预设词向量库中的每个文本分词排序。之后,对于文本信息,终端设备可先将整体的文本信息作为一个文本分词,若文预设词向量库中存在与文本分词一致的分词,则将预设词向量库中该分词的序号,作为分词向量,参与后续处理。若预设词向量库中未存在与整体的文本信息一致的文本分词,则删除文本信息中末尾的一个字符。而后,将剩余的文本信息作为一个文本分词,再次与预设词向量库中的分词进行匹配。若匹配成功,则将剩余的文本信息作为一个文本分词。若匹配失败,则再次删除剩余的文本信息中末尾的一个字符,重复上述步骤,直至得到多个文本分词,以及每个文本分词对应的分词向量。In an embodiment, the terminal device may perform word segmentation and vector processing on the text information as follows: the terminal device pre-builds a preset word vector library, and sorts each text segmentation in the preset word vector library. Afterwards, for the text information, the terminal device can first regard the overall text information as a text word segmentation, and if there is a word segmentation consistent with the text segmentation in the text preset word vector library, the sequence number of the word segmentation in the preset word vector library is used as the word segmentation. Word segmentation vector, involved in subsequent processing. If there is no text word segmentation consistent with the overall text information in the preset word vector library, delete one character at the end of the text information. Then, the remaining text information is used as a text word segmentation, which is matched with the word segmentation in the preset word vector library again. If the match is successful, the remaining text information is used as a text word segmentation. If the matching fails, delete one character at the end of the remaining text information again, and repeat the above steps until a plurality of text segments and a segment vector corresponding to each text segment are obtained.

在一实施例中,上述敏感词结果包括但不限于为包含敏感词的结果以及不包含敏感词的结果。其中,包含敏感词结果为该采样数据中的文本分词为敏感词,且该敏感词结果可同时包括该敏感词的敏感词类型。具体的,对于文本信息为“请确认您的身份证号是否为“XXXX””的信息。其中,该文本信息中的文本分词可以分别为“请”,“确认”,“您的”,“身份证号”,“是否”,“为”,“XXXX”。其中,敏感词识别模型可将“身份证号”确定为敏感词类型,并基于该敏感词类型将对应的“XXXX”分词确定为敏感词。In one embodiment, the above-mentioned sensitive word results include, but are not limited to, results that contain sensitive words and results that do not contain sensitive words. Wherein, the result of containing the sensitive word is that the text segmentation in the sampled data is a sensitive word, and the sensitive word result may also include the sensitive word type of the sensitive word. Specifically, for the text information "Please confirm whether your ID number is "XXXX"". Among them, the text word segmentation in the text information can be "please", "confirm", "your", "identity number", "whether", "for", "XXXX". Among them, the sensitive word recognition model can determine the "ID card number" as the sensitive word type, and determine the corresponding "XXXX" segment as the sensitive word based on the sensitive word type.

其中,终端设备在确定出敏感词类型后,还可基于敏感词类型对应的匹配规则从多个文本分词中匹配敏感词。示例性的,可基于身份证的格式(格式为18位的数字)识别敏感词“XXXX”。Wherein, after determining the type of the sensitive word, the terminal device may also match the sensitive word from multiple text segmentations based on the matching rule corresponding to the type of the sensitive word. Exemplarily, the sensitive word "XXXX" may be recognized based on the format of the ID card (the format is an 18-digit number).

在一实施例中,针对任一文本信息中的多个文本分词,若上述敏感词识别模型只可从该文本信息中的多个文本分词中,识别出文本分词的敏感词类型,而识别不出任一具体的敏感词,则敏感词识别模型可认为该文本信息为不包括敏感词的敏感词结果。In an embodiment, for a plurality of text segmentations in any text information, if the above-mentioned sensitive word recognition model can only identify the sensitive word types of the text segmentations from the multiple text segmentations in the text information, but not identify the sensitive word types of the text segmentations. If any specific sensitive word is selected, the sensitive word recognition model can consider the text information as a sensitive word result that does not include the sensitive word.

示例性的,若文本信息为“请提供您的身份证号”,敏感词识别模型则只可从文本信息中识别出相应的敏感词类型。例如,“身份证号”。但是,无法识别到具体的敏感词。例如,无法基于身份证的格式(格式为18位的数字)从多个文本分词中识别出符合要求的敏感词“XXXX”。因此,敏感词识别模型可输出该文本信息中不存在敏感词的结果。Exemplarily, if the text information is "Please provide your ID number", the sensitive word recognition model can only identify the corresponding sensitive word type from the text information. For example, "ID number". However, specific sensitive words could not be identified. For example, based on the format of the ID card (the format is 18 digits), the sensitive word "XXXX" that meets the requirements cannot be identified from multiple text segmentations. Therefore, the sensitive word recognition model can output the result that no sensitive words exist in the text information.

在本实施例中,先通过对服务器返回至客户端的数据流进行拷贝后,对拷贝数据流进行处理,而不直接对返回的数据流进行处理,可在对数据流进行敏感词分析处理基础上,还可保证服务器返回至客户端的数据不会造成延时的问题。之后,基于对拷贝数据流中当前数据被采样的处理顺序M,以

Figure BDA0003282457360000101
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据。以此,可以使终端设备在接收到当前数据时即进行采样处理,而不是在接收数据流中所有的数据后在进行采样处理。并且,可避免终端设备使用现有降采样方法的局限性,使终端设备能够提供相同的被采样的概率值对数量未知的数据进行降采样处理。进而,可以使终端设备在对采样效果更好的采样数据进行敏感词分析处理时,其得到的敏感词结果更接近于整体的数据流中各个数据的敏感词结果。In this embodiment, after copying the data stream returned from the server to the client, the copied data stream is processed without directly processing the returned data stream. , it can also ensure that the data returned by the server to the client will not cause delay problems. After that, based on the processing order M in which the current data in the copied data stream is sampled,
Figure BDA0003282457360000101
Update the determined N sampled data with the probability of , and obtain the updated N sampled data. In this way, the terminal device can perform sampling processing when receiving the current data, instead of performing sampling processing after receiving all the data in the data stream. Moreover, the limitation of using the existing down-sampling method by the terminal device can be avoided, so that the terminal device can provide the same sampled probability value to perform down-sampling processing on an unknown amount of data. Furthermore, when the terminal device performs sensitive word analysis processing on the sampled data with better sampling effect, the sensitive word result obtained by the terminal device is closer to the sensitive word result of each data in the overall data stream.

在一实施例中,参照图4,在S101对服务器返回的数据流进行拷贝,得到拷贝数据流中,具体可通过如下子步骤S1011-S1012实现,详述如下:In one embodiment, referring to FIG. 4 , in S101, the data stream returned by the server is copied to obtain the copied data stream, which can be specifically implemented through the following sub-steps S1011-S1012, which are described in detail as follows:

S1011、针对数据流中的任一数据,终端设备识别数据的数据结构。S1011. For any data in the data stream, the terminal device identifies the data structure of the data.

S1012、终端设备对数据中属于目标数据结构的数据进行拷贝,得到数据对应的拷贝数据。S1012: The terminal device copies the data belonging to the target data structure in the data to obtain copy data corresponding to the data.

在一实施例中,上述数据通常以报文的形式进行发送。因此,上述数据的数据结构也可以认为是报文结构。其中,报文结构通常为:状态行<status-line>;消息报头<headers>;响应正文[<response-body>]。其中,状态行中包含了服务器HTTP协议的版本、响应状态等信息。消息报头中包含响应报文的一些附加信息。例如,响应正文的类型、响应正文长度以及使用的编码格式。响应正文包含服务器发送给客户端的具体内容,也即客户端所请求的内容。In one embodiment, the above-mentioned data is usually sent in the form of a message. Therefore, the data structure of the above data can also be regarded as a message structure. Among them, the message structure is usually: status line <status-line>; message header <headers>; response body [<response-body>]. Among them, the status line contains information such as the version of the server HTTP protocol and the response status. The message header contains some additional information about the response message. For example, the type of response body, the length of the response body, and the encoding format used. The response body contains the specific content sent by the server to the client, that is, the content requested by the client.

基于此,终端设备可认为响应报文中的内容即为所需的实质性数据。因此,终端设备可识别报文的报文结构,并将响应正文的结构确定为目标数据结构,以对响应正文中包含的数据进行拷贝,得到拷贝数据。Based on this, the terminal device may consider that the content in the response message is the required substantive data. Therefore, the terminal device can identify the message structure of the message, and determine the structure of the response body as the target data structure, so as to copy the data contained in the response body to obtain the copied data.

可以理解的是,终端设备在基于流量旁路技术对一个报文中的部分数据进行拷贝,不仅可以使拷贝数据包含服务器返回的实质性数据,还可降低终端设备所需进行拷贝的数据量。It can be understood that, when the terminal device copies part of the data in a packet based on the traffic bypass technology, the copied data can not only include the substantial data returned by the server, but also reduce the amount of data that the terminal device needs to copy.

在一实施例中,参照图5,在S103对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果之后,还包括如下步骤S131-S132,详述如下:In one embodiment, referring to FIG. 5 , in S103, the sensitive word analysis processing is performed on the updated N sample data, and after obtaining the sensitive word result, the following steps S131-S132 are further included, which are described in detail as follows:

S131、终端设备根据敏感词结果,统计包含敏感词的采样数据的数量。S131. The terminal device counts the number of sampled data including the sensitive word according to the result of the sensitive word.

S132、若包含敏感词的采样数据的数量与N之间的比值大于预设值,则终端设备对服务器返回的数据流均进行敏感词分析处理。S132. If the ratio between the number of sampled data containing sensitive words and N is greater than a preset value, the terminal device performs sensitive word analysis processing on all data streams returned by the server.

在一实施例中,上述预设值可以由工作人员预先根据实际情况进行设置,对此不作限定。需要补充的是,若拷贝数据流中包含的数据的数量少于N个,则基于S103中的说明可知,该拷贝数据流中的所有数据均将作为采样数据。因此,在计算比值时,则是计算包含敏感词的采样数据的数量与采样数据的总数量之间的比值。In an embodiment, the above-mentioned preset value may be set by the staff in advance according to the actual situation, which is not limited. It should be added that if the number of data contained in the copied data stream is less than N, it can be known based on the description in S103 that all data in the copied data stream will be used as sample data. Therefore, when calculating the ratio, the ratio between the number of sampled data containing sensitive words and the total number of sampled data is calculated.

在一实施例中,基于上述S103中的说明可知,终端设备可对每一个采样数据进行分析处理,得到每个采样数据的敏感词结果。基于此,根据每个采样数据的的敏感词结果,终端设备可以统计包含敏感词的采样数据的数量。之后,计算该数量与采样数据的数量N的比值,并在判定比值大于预设值时,判定服务器返回至客户端的数据流中包含了大量的敏感词,其可能存在泄露敏感数据泄露的风险。In an embodiment, based on the description in S103 above, it can be known that the terminal device can analyze and process each sampled data to obtain a sensitive word result of each sampled data. Based on this, according to the sensitive word result of each sampled data, the terminal device can count the number of sampled data containing sensitive words. Then, the ratio of the number to the number N of sampled data is calculated, and when it is determined that the ratio is greater than the preset value, it is determined that the data stream returned by the server to the client contains a large number of sensitive words, which may have the risk of leaking sensitive data.

基于此,终端设备可直接对下一次服务器返回的数据流(此时,可不对返回的数据流进行拷贝处理)均进行敏感词分析处理,而后根据敏感词结果对其包含的敏感词进行数据脱敏,以防止信息泄露。具体的,参照图6,在对下一次服务器返回的数据流均进行敏感词分析处理中,具体包括如下子步骤S1321--S1322,详述如下:Based on this, the terminal device can directly perform sensitive word analysis processing on the data stream returned by the server next time (in this case, the returned data stream may not be copied), and then perform data extraction on the sensitive words contained in it according to the sensitive word result. sensitive to prevent information leakage. Specifically, referring to FIG. 6 , in the process of performing sensitive word analysis on the data stream returned by the server next time, the following sub-steps S1321--S1322 are specifically included, which are described in detail as follows:

S1321、终端设备对服务器返回的数据流中包含的敏感词进行脱敏处理,得到脱敏数据。S1321 , the terminal device performs desensitization processing on the sensitive words contained in the data stream returned by the server to obtain desensitized data.

S1322、终端设备将脱敏数据发送至客户端。S1322, the terminal device sends the desensitization data to the client.

其中,数据脱敏是指对敏感词通过脱敏规则进行数据的变形,得到脱敏数据,以实现对敏感词的可靠保护。其中,脱敏规则具体可以由工作人员预先在终端设备和客户端中进行设置,以使客户端可以基于相同的脱敏规则对脱敏数据进行数据还原,得到服务器返回的原始的数据。Among them, data desensitization refers to deforming sensitive words through desensitization rules to obtain desensitized data, so as to achieve reliable protection of sensitive words. Specifically, the desensitization rules can be set by the staff in the terminal device and the client in advance, so that the client can restore the desensitized data based on the same desensitization rules, and obtain the original data returned by the server.

请参阅图7,图7是本申请实施例提供的一种敏感数据分析装置的结构框图。本实施例中敏感数据分析装置包括的各模块用于执行图1至图6对应的实施例中的各步骤。具体请参阅图1至图6以及图1至图6所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图7,敏感数据分析装置700可以包括:拷贝模块710、采样模块720以及分析模块730,其中:Please refer to FIG. 7 . FIG. 7 is a structural block diagram of an apparatus for analyzing sensitive data provided by an embodiment of the present application. Each module included in the sensitive data analysis apparatus in this embodiment is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 . For details, please refer to FIG. 1 to FIG. 6 and the related descriptions in the embodiments corresponding to FIG. 1 to FIG. 6 . For convenience of explanation, only the parts related to this embodiment are shown. Referring to FIG. 7, the sensitive data analysis apparatus 700 may include: a copying module 710, a sampling module 720 and an analysis module 730, wherein:

拷贝模块710,用于对服务器返回的数据流进行拷贝,得到拷贝数据流。The copying module 710 is configured to copy the data stream returned by the server to obtain the copied data stream.

采样模块720,用于针对拷贝数据流中的任一当前数据,根据当前数据被采样的处理顺序M,以

Figure BDA0003282457360000121
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据。The sampling module 720 is configured to, for any current data in the copy data stream, according to the processing sequence M in which the current data is sampled, to
Figure BDA0003282457360000121
Update the determined N sampled data with the probability of , and obtain the updated N sampled data.

分析模块730,用于对更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。The analysis module 730 is configured to perform sensitive word analysis processing on the updated N sampled data to obtain a sensitive word result.

在一实施例中,拷贝模块710还用于:In one embodiment, the copy module 710 is also used to:

针对数据流中的任一数据,识别数据的数据结构;对数据中属于目标数据结构的数据进行拷贝,得到数据对应的拷贝数据。For any data in the data stream, identify the data structure of the data; copy the data belonging to the target data structure in the data to obtain the copy data corresponding to the data.

在一实施例中,敏感数据分析装置700还包括:In one embodiment, the sensitive data analysis apparatus 700 further includes:

确定模块,用于若M≤N,则确定当前数据为采样数据,直至得到已确定的N个采样数据。The determining module is configured to determine that the current data is sampled data if M≤N, until the determined N sampled data are obtained.

在一实施例中,采样模块720还用于:In one embodiment, the sampling module 720 is further configured to:

若以

Figure BDA0003282457360000131
的概率确定当前数据不为采样数据,则保持已确定的N个采样数据不变;若以
Figure BDA0003282457360000132
的概率确定当前数据为采样数据,则以
Figure BDA0003282457360000133
的概率从已确定的N个采样数据中,确定一个需被当前数据更新的替换采样数据;将当前数据更新替换采样数据,得到更新后的N个采样数据。if
Figure BDA0003282457360000131
The probability of determining that the current data is not sampled data, then keep the determined N sampled data unchanged;
Figure BDA0003282457360000132
The probability of determining that the current data is sampled data, then use
Figure BDA0003282457360000133
The probability of determining a replacement sample data that needs to be updated by the current data is determined from the determined N sample data; the current data is updated to replace the sample data to obtain the updated N sample data.

在一实施例中,分析模块730还用于:In one embodiment, the analysis module 730 is also used to:

针对任一采样数据,识别采样数据,生成文本信息;对文本信息进分词,得到文本信息的多个文本分词;根据多个文本分词分别在预设词向量库中的位置信息,确定多个文本分词的分词向量;根据分词向量分别对多个文本分词进行识别,得到采样数据的敏感词结果。For any sampled data, identify the sampled data and generate text information; segment the text information to obtain multiple text segmentations of the text information; determine multiple texts according to the position information of the multiple text segmentations in the preset word vector library respectively The word segmentation vector of the word segmentation; according to the word segmentation vector, identify multiple text segmentations respectively, and obtain the sensitive word results of the sampled data.

在一实施例中,敏感数据分析装置700还包括:In one embodiment, the sensitive data analysis apparatus 700 further includes:

统计模块,用于根据敏感词结果,统计包含敏感词的采样数据的数量。The statistics module is used to count the number of sampled data containing sensitive words according to the results of sensitive words.

处理模块,用于若包含敏感词的采样数据的数量与N之间的比值大于预设值,则对服务器返回的数据流均进行敏感词分析处理。The processing module is configured to perform sensitive word analysis processing on all data streams returned by the server if the ratio between the number of sampled data containing sensitive words and N is greater than a preset value.

在一实施例中,处理模块还用于:In one embodiment, the processing module is further used to:

对服务器返回的数据流中包含的敏感词进行脱敏处理,得到脱敏数据;将脱敏数据发送至客户端。Perform desensitization processing on the sensitive words contained in the data stream returned by the server to obtain desensitized data; send the desensitized data to the client.

当理解的是,图7示出的敏感数据分析装置的结构框图中,各模块用于执行图1至图6对应的实施例中的各步骤,而对于图1至图6对应的实施例中的各步骤已在上述实施例中进行详细解释,具体请参阅图图1至图6以及图1至图6所对应的实施例中的相关描述,此处不再赘述。It should be understood that, in the structural block diagram of the sensitive data analysis device shown in FIG. 7 , each module is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 , and for the embodiment corresponding to FIG. 1 to FIG. 6 Each step of the above has been explained in detail in the above-mentioned embodiments. For details, please refer to FIG. 1 to FIG. 6 and the relevant descriptions in the corresponding embodiments of FIG. 1 to FIG.

图8是本申请一实施例提供的一种终端设备的结构框图。如图8所示,该实施例的终端设备800包括:处理器810、存储器820以及存储在存储器820中并可在处理器810运行的计算机程序830,例如敏感数据分析方法的程序。处理器810执行计算机程序830时实现上述各个敏感数据分析方法各实施例中的步骤,例如图1所示的S101至S103。或者,处理器810执行计算机程序830时实现上述图7对应的实施例中各模块的功能,例如,图7所示的模块710至730的功能,具体请参阅图7对应的实施例中的相关描述。FIG. 8 is a structural block diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 8 , the terminal device 800 in this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a sensitive data analysis method. When the processor 810 executes the computer program 830, it implements the steps in the various embodiments of the above-mentioned sensitive data analysis methods, for example, S101 to S103 shown in FIG. 1 . Alternatively, when the processor 810 executes the computer program 830, the functions of the modules in the embodiment corresponding to FIG. 7 are implemented, for example, the functions of the modules 710 to 730 shown in FIG. describe.

示例性的,计算机程序830可以被分割成一个或多个模块,一个或者多个模块被存储在存储器820中,并由处理器810执行,以实现本申请实施例提供的敏感数据分析方法。一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序830在终端设备800中的执行过程。例如,计算机程序830可以实现本申请实施例提供的敏感数据分析方法。Exemplarily, the computer program 830 may be divided into one or more modules, and one or more modules are stored in the memory 820 and executed by the processor 810 to implement the sensitive data analysis method provided by the embodiments of the present application. One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the terminal device 800 . For example, the computer program 830 can implement the sensitive data analysis method provided by the embodiments of the present application.

终端设备800可包括,但不仅限于,处理器810、存储器820。本领域技术人员可以理解,图8仅仅是终端设备800的示例,并不构成对终端设备800的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device 800 may include, but is not limited to, a processor 810 and a memory 820 . Those skilled in the art can understand that FIG. 8 is only an example of the terminal device 800, and does not constitute a limitation on the terminal device 800, and may include more or less components than the one shown, or combine some components, or different components For example, the terminal device may also include an input and output device, a network access device, a bus, and the like.

所称处理器810可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 810 can be a central processing unit, and can also be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components Wait. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

存储器820可以是终端设备800的内部存储单元,例如终端设备800的硬盘或内存。存储器820也可以是终端设备800的外部存储设备,例如终端设备800上配备的插接式硬盘,智能存储卡,闪存卡等。进一步地,存储器820还可以既包括终端设备800的内部存储单元也包括外部存储设备。The memory 820 may be an internal storage unit of the terminal device 800 , such as a hard disk or a memory of the terminal device 800 . The memory 820 may also be an external storage device of the terminal device 800 , such as a plug-in hard disk, a smart memory card, a flash memory card, etc., which are equipped on the terminal device 800 . Further, the memory 820 may also include both an internal storage unit of the terminal device 800 and an external storage device.

本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述各个实施例中的敏感数据分析方法。An embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the sensitive data analysis method in each of the foregoing embodiments is implemented. .

第四方面,本申请实施例提供了一种计算机可读存储介质,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述各个实施例中的敏感数据分析方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above-mentioned various implementations are implemented. Examples of sensitive data analysis methods.

第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述各个实施例中的敏感数据分析方法。In a fifth aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, enables the terminal device to execute the sensitive data analysis method in each of the foregoing embodiments.

以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in the application. within the scope of protection.

Claims (10)

1.一种敏感数据分析方法,其特征在于,包括:1. a sensitive data analysis method, is characterized in that, comprises: 对服务器返回的数据流进行拷贝,得到拷贝数据流;Copy the data stream returned by the server to obtain the copied data stream; 针对所述拷贝数据流中的任一当前数据,根据所述当前数据被采样的处理顺序M,以
Figure FDA0003282457350000011
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据;
For any current data in the copy data stream, according to the processing sequence M in which the current data is sampled, to
Figure FDA0003282457350000011
Update the determined N sampled data with the probability of , and obtain the updated N sampled data;
对所述更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。Sensitive word analysis processing is performed on the updated N sampled data to obtain a sensitive word result.
2.根据权利要求1所述的敏感数据分析方法,其特征在于,对服务器返回的数据流进行拷贝,得到拷贝数据流,包括:2. The sensitive data analysis method according to claim 1, wherein the data stream returned by the server is copied to obtain the copied data stream, comprising: 针对所述数据流中的任一数据,识别所述数据的数据结构;For any data in the data stream, identifying the data structure of the data; 对所述数据中属于目标数据结构的数据进行拷贝,得到所述数据对应的拷贝数据。Copy the data belonging to the target data structure in the data to obtain copy data corresponding to the data. 3.根据权利要求1所述的敏感数据分析方法,其特征在于,在根据所述当前数据被采样的处理顺序M,以
Figure FDA0003282457350000012
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据之前,还包括:
3. The sensitive data analysis method according to claim 1, characterized in that, in the processing sequence M that is sampled according to the current data, to
Figure FDA0003282457350000012
The probability of updating the determined N sampled data, and before obtaining the updated N sampled data, also includes:
若所述M≤N,则确定所述当前数据为所述采样数据,直至得到所述已确定的N个采样数据。If the M≤N, the current data is determined to be the sampled data until the determined N sampled data are obtained.
4.根据权利要求1-3任一所述的敏感数据分析方法,其特征在于,根据所述当前数据被采样的处理顺序M,以
Figure FDA0003282457350000013
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据,包括:
4. The sensitive data analysis method according to any one of claims 1-3, characterized in that, according to the processing sequence M in which the current data is sampled, to
Figure FDA0003282457350000013
Update the determined N sampled data with the probability of , and obtain the updated N sampled data, including:
若所述以
Figure FDA0003282457350000014
的概率确定所述当前数据不为所述采样数据,则保持所述已确定的N个采样数据不变;
If said to
Figure FDA0003282457350000014
The probability of determining that the current data is not the sampled data, then keep the determined N sampled data unchanged;
若所述以
Figure FDA0003282457350000015
的概率确定所述当前数据为所述采样数据,则以
Figure FDA0003282457350000016
的概率从所述已确定的N个采样数据中,确定一个需被所述当前数据更新的替换采样数据;
If said to
Figure FDA0003282457350000015
The probability of determining that the current data is the sampling data, then use
Figure FDA0003282457350000016
The probability of determining a replacement sample data that needs to be updated by the current data from the determined N sample data;
采用所述当前数据更新所述替换采样数据,得到更新后的所述N个采样数据。The replacement sampling data is updated by using the current data to obtain the updated N sampling data.
5.根据权利要求1所述的敏感数据分析方法,其特征在于,所述对所述更新后的N个采样数据进行敏感词分析处理,得到敏感词结果,包括:5. The sensitive data analysis method according to claim 1, wherein the said updated N sampled data is subjected to sensitive word analysis processing to obtain a sensitive word result, comprising: 针对任一采样数据,识别所述采样数据,生成文本信息;For any sampled data, identify the sampled data, and generate text information; 对所述文本信息进分词,得到所述文本信息的多个文本分词;Segmenting the text information to obtain multiple text segmentations of the text information; 根据所述多个文本分词分别在预设词向量库中的位置信息,确定所述多个文本分词的分词向量;Determine word segmentation vectors of the multiple text segmentations according to the respective position information of the multiple text segmentations in the preset word vector library; 根据所述分词向量分别对所述多个文本分词进行识别,得到所述采样数据的敏感词结果。The multiple text word segmentations are respectively identified according to the word segmentation vector, and the sensitive word result of the sampled data is obtained. 6.根据权利要求1所述的敏感数据分析方法,其特征在于,在对所述更新后的N个采样数据进行敏感词分析处理,得到敏感词结果之后,还包括:6. The sensitive data analysis method according to claim 1, characterized in that, after performing sensitive word analysis processing on the updated N sampled data to obtain a sensitive word result, further comprising: 根据所述敏感词结果,统计包含敏感词的采样数据的数量;According to the sensitive word results, count the number of sampled data containing sensitive words; 若所述包含敏感词的采样数据的数量与所述N之间的比值大于预设值,则对所述服务器返回的数据流均进行敏感词分析处理。If the ratio between the quantity of the sampled data including the sensitive words and the N is greater than the preset value, the sensitive word analysis processing is performed on all the data streams returned by the server. 7.根据权利要求6所述的敏感数据分析方法,其特征在于,若所述包含敏感词的采样数据的数量与所述N之间的比值大于预设值,则对所述服务器返回的数据流均进行敏感词分析处理中,包括:7 . The sensitive data analysis method according to claim 6 , wherein if the ratio between the quantity of the sampled data including the sensitive word and the N is greater than a preset value, the data returned by the server Streams are processed for sensitive word analysis, including: 对所述服务器返回的数据流中包含的敏感词进行脱敏处理,得到脱敏数据;Perform desensitization processing on the sensitive words contained in the data stream returned by the server to obtain desensitized data; 将所述脱敏数据发送至客户端。The desensitized data is sent to the client. 8.一种敏感数据分析装置,其特征在于,包括:8. A sensitive data analysis device, comprising: 拷贝模块,用于对服务器返回的数据流进行拷贝,得到拷贝数据流;The copy module is used to copy the data stream returned by the server to obtain the copied data stream; 采样模块,用于针对所述拷贝数据流中的任一当前数据,根据所述当前数据被采样的处理顺序M,以
Figure FDA0003282457350000021
的概率对已确定的N个采样数据进行更新,得到更新后的N个采样数据;
Sampling module, for any current data in the copy data stream, according to the processing sequence M in which the current data is sampled, to
Figure FDA0003282457350000021
Update the determined N sampled data with the probability of , and obtain the updated N sampled data;
分析模块,用于对所述更新后的N个采样数据进行敏感词分析处理,得到敏感词结果。An analysis module, configured to perform sensitive word analysis processing on the updated N sampled data to obtain a sensitive word result.
9.一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。9. A terminal device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the implementation as claimed in the claims The method of any one of 1 to 7. 10.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。10 . A computer-readable storage medium storing a computer program, wherein, when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented. 11 .
CN202111137129.3A 2021-09-27 2021-09-27 Sensitive data analysis method and device, terminal equipment and storage medium Pending CN113868297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111137129.3A CN113868297A (en) 2021-09-27 2021-09-27 Sensitive data analysis method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111137129.3A CN113868297A (en) 2021-09-27 2021-09-27 Sensitive data analysis method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113868297A true CN113868297A (en) 2021-12-31

Family

ID=78991294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111137129.3A Pending CN113868297A (en) 2021-09-27 2021-09-27 Sensitive data analysis method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113868297A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110313147A (en) * 2017-03-09 2019-10-08 西门子公司 Data processing method, device and system
CN110891047A (en) * 2019-10-08 2020-03-17 中国信息通信研究院 Smart speaker data stream processing method and system
CN112000467A (en) * 2020-07-24 2020-11-27 广东技术师范大学 Data tilt processing method and device, terminal equipment and storage medium
CN112668052A (en) * 2020-12-30 2021-04-16 北京天融信网络安全技术有限公司 Data desensitization method and device, storage medium and electronic equipment
CN112800250A (en) * 2021-02-05 2021-05-14 联想(北京)有限公司 Multimedia data stream processing method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110313147A (en) * 2017-03-09 2019-10-08 西门子公司 Data processing method, device and system
CN110891047A (en) * 2019-10-08 2020-03-17 中国信息通信研究院 Smart speaker data stream processing method and system
CN112000467A (en) * 2020-07-24 2020-11-27 广东技术师范大学 Data tilt processing method and device, terminal equipment and storage medium
CN112668052A (en) * 2020-12-30 2021-04-16 北京天融信网络安全技术有限公司 Data desensitization method and device, storage medium and electronic equipment
CN112800250A (en) * 2021-02-05 2021-05-14 联想(北京)有限公司 Multimedia data stream processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
WO2020119053A1 (en) Picture clustering method and apparatus, storage medium and terminal device
CN112507704B (en) Multi-intention recognition method, device, equipment and storage medium
WO2023272852A1 (en) Method and apparatus for classifying user by using decision tree model, device and storage medium
CN110399489B (en) Chat data segmentation method, device and storage medium
CN113268453A (en) Log information compression storage method and device
CN111507090A (en) Abstract extraction method, apparatus, device, and computer-readable storage medium
CN114692085B (en) Feature extraction method and device, storage medium and electronic equipment
CN111091182A (en) Data processing method, electronic device and storage medium
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium
CN113901817A (en) Document classification method and device, computer equipment and storage medium
CN110022343B (en) Adaptive Event Aggregation
CN112784596A (en) Method and device for identifying sensitive words
WO2025031380A1 (en) Abnormal behavior test method, abnormal behavior test device, electronic device, non-transitory computer-readable storage medium, and computer program product
CN113868297A (en) Sensitive data analysis method and device, terminal equipment and storage medium
WO2022142025A1 (en) Text classification method and apparatus, and terminal device and storage medium
CN113221035A (en) Method, apparatus, device, medium, and program product for determining an abnormal web page
CN117311719A (en) Code generation method, device, equipment and storage medium
CN111309850A (en) A data feature extraction method, device, terminal device and medium
CN115115920B (en) Graph data self-supervision training method and device
CN111143461A (en) Mapping relation processing system and method and electronic equipment
CN114926677B (en) Image classification method, device, equipment and storage medium
CN116975300A (en) Information mining method and system based on big data set
WO2019085075A1 (en) Information element set generation method and rule execution method based on rule engine
CN116089732A (en) User preference identification method and system based on advertisement click data
CN110929512A (en) A data enhancement method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination