KR102054450B1

KR102054450B1 - Differential privacy-based query processing system and privacy budget saving method using the same

Info

Publication number: KR102054450B1
Application number: KR1020170161286A
Authority: KR
Inventors: 박석; 강민석
Original assignee: 서강대학교 산학협력단
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2019-12-10
Anticipated expiration: 2037-11-29
Also published as: KR20190062764A

Abstract

질의 처리 시스템이 프라이버시 예산을 절약하는 방법으로서, 수신한 복수의 질의 각각에 대한 최대 갱신 횟수를 계산하고, 계산한 최대 갱신 횟수와 미리 설정된 전체 프라이버시 예산을 기초로 복수의 질의 각각에 대한 예산을 할당한다. 예산이 할당된 복수의 질의 중 적어도 하나의 질의를 하나의 배치에 포함시키고, 하나의 배치에 포함된 적어도 하나의 질의를 재배치한 후 재배치된 적어도 하나의 질의를 처리한다.A method for saving a privacy budget by a query processing system, comprising: calculating a maximum update count for each of a plurality of received queries, and allocating a budget for each of the plurality of queries based on the calculated maximum update count and a preset total privacy budget. do. At least one query among the plurality of budget-allocated queries is included in one batch, the at least one query included in one batch is rearranged, and the rearranged at least one query is processed.

Description

Differential privacy-based query processing system and privacy budget saving method using the same}

본 발명은 차분 프라이버시 기반 질의 처리 시스템 및 이를 이용한 프라이버시 예산 절약 방법에 관한 것이다.The present invention relates to a differential privacy based query processing system and a privacy budget saving method using the same.

대화형 환경에서 데이터를 공개하거나 임의의 사용자가 질의한 내용에 대한질의 결과를 제공할 때, 개인 식별을 방지하기 위하여 개인 정보를 보호하는 기법은 필수적이다. 이를 위해 k-익명화 등의 구조적 보호 기법들이 연구되어 왔다. 그러나, 차분 공격을 비롯한 다양한 공격 기법들의 출현으로, 공격자의 사전 지식과 무관하게 일정 수준 이상으로 프라이버시를 보호할 수 있는 차분 프라이버시(differential privacy)가 연구되고 있다.When publishing data in an interactive environment or providing query results for a query made by any user, a technique for protecting personal information is essential to prevent personal identification. To this end, structural protection techniques such as k-anonymization have been studied. However, with the advent of various attack techniques including differential attack, differential privacy that can protect the privacy above a certain level irrespective of the attacker's prior knowledge is being studied.

질의 결과에 대해 차분 프라이버시를 만족시키기 위해서는, 일정 수준의 노이즈를 원 데이터에 삽입하여 질의 결과로 제공해야 한다. 이때 원 데이터에 삽입되는 노이즈 값을 도출하는 과정에서, 질의자는 자신에게 할당된 프라이버시 예산(privacy budget)의 일부를 소모한다. In order to satisfy the differential privacy for the query results, a certain level of noise must be inserted into the raw data and provided as a query result. At this time, in the process of deriving the noise value inserted into the raw data, the queryer spends a part of the privacy budget allocated to the query.

예산을 많이 사용할수록 삽입되는 노이즈의 양이 줄어들어, 질의 결과의 정확도는 커진다. 그러나, 질의자의 예산은 한정되어 있기 때문에, 대량의 질의 처리시 각 질의에 적용될 예산을 적절히 분할하여 적용해야 한다. 그리고 예산이 너무 적게 적용될 경우 대량의 노이즈가 삽입되어, 질의 결과의 정확도가 크게 저하되는 문제점이 있다. The more the budget is spent, the less noise is inserted, and the more accurate the query results. However, since the budget of the query is limited, the budget to be applied to each query should be appropriately divided and applied when processing a large number of queries. In addition, when the budget is too small, a large amount of noise is inserted, which greatly reduces the accuracy of query results.

따라서, 본 발명은 대화형 환경에서 차분 프라이버시를 적용하여, 선형 질의를 처리하는 질의 처리 시스템 및 이를 이용한 프라이버시 예산을 절약하는 방법을 제공한다.Accordingly, the present invention provides a query processing system for processing linear queries by applying differential privacy in an interactive environment, and a method of saving a privacy budget using the same.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 하나의 특징인 질의 처리 시스템이 차분 프라이버시 질의를 처리하는 방법으로서,As a method of processing a differential privacy query, a query processing system which is one feature of the present invention for achieving the technical problem of the present invention,

수신한 복수의 질의 각각에 대한 최대 갱신 횟수를 계산하는 단계, 상기 계산한 최대 갱신 횟수와 미리 설정된 전체 프라이버시 예산을 기초로 상기 복수의 질의 각각에 대한 예산을 할당하는 단계, 예산이 할당된 복수의 질의 중 적어도 하나의 질의를 하나의 배치에 포함시키는 단계, 그리고 상기 하나의 배치에 포함된 적어도 하나의 질의를 재배치하고, 상기 재배치된 적어도 하나의 질의를 처리하는 단계를 포함한다.Calculating a maximum number of updates for each of the plurality of received queries, allocating a budget for each of the plurality of queries based on the calculated maximum number of updates and a preset total privacy budget, a plurality of allocated budgets Including at least one of the queries in one batch, and relocating at least one query included in the one batch, and processing the relocated at least one query.

상기 최대 갱신 횟수를 계산하는 단계는, 상기 수신한 복수의 질의가 저장된 데이터베이스로부터 원본 히스토그램인 제1 히스토그램과 상이한 균일 분포 형태의 가상 히스토그램인 제2 히스토그램을 생성하는 단계, 상기 제2 히스토그램으로부터 도출된 질의 결과 값인 제1 결과 값과 상기 제1 히스토그램으로부터 도출된 질의 결과 값인 제2 결과 값을 계산하는 단계, 상기 제1 결과 값과 제2 결과 값의 차이가 미리 설정된 임계값을 벗어나는지 확인하는 단계, 그리고 상기 차이가 상기 임계값을 벗어나면, 상기 제2 결과 값에 노이즈를 삽입한 결과를 배포하고, 상기 가상 히스토그램을 갱신하는 단계를 포함할 수 있다.The calculating of the maximum update number may include generating a second histogram, which is a virtual histogram having a uniform distribution that is different from a first histogram, which is an original histogram, from a database in which the plurality of queries are stored. Calculating a first result value that is a query result value and a second result value that is a query result value derived from the first histogram, and checking whether a difference between the first result value and the second result value deviates from a preset threshold value And distributing a result of inserting noise into the second result value and updating the virtual histogram when the difference is out of the threshold value.

상기 최대 갱신 횟수를 계산하는 단계는, 상기 제1 히스토그램과 초기화된 제2 히스토그램 사이의 상대 엔트로피의 상한 값과, 히스토그램 매개 변수, 상기 복수의 질의의 개수, 그리고 상기 임계값을 벗어나는 라플라스 분포 영역을 산출한 값을 기초로 최대 갱신 횟수가 결정될 수 있다.The calculating of the maximum number of updates may include an upper limit value of a relative entropy between the first histogram and the initialized second histogram, a histogram parameter, the number of the plurality of queries, and a Laplace distribution region that is out of the threshold. The maximum update count may be determined based on the calculated value.

상기 하나의 배치에 포함시키는 단계는, 포와송 분포에 따라 질의가 진입할 확률이 질의가 임의로 진입할 확률인 균일 분포 값 보다 높아질 때까지 적어도 하나의 질의를 상기 배치에 포함시키는 단계를 포함할 수 있다.Including in one batch may include including at least one query in the batch until a probability that a query enters according to a Poisson distribution is higher than a uniform distribution value that is a probability that the query randomly enters. have.

상기 배치에 포함시키는 단계는, 최대 갱신 횟수와 질의 개수 그리고 상기 전체 프라이버시 예산을 토대로 배치 크기의 최대 임계 값을 설정하는 단계를 포함할 수 있다.The step of including in the batch may include setting a maximum threshold value of the batch size based on the maximum number of updates, the number of queries, and the total privacy budget.

상기 하나의 질의를 처리하는 단계는, 상기 제1 결과 값과 제2 결과 값의 차이가 상기 임계값보다 크면, 거리 값을 계산하는 단계, 그리고 계산한 거리 값을 기준으로 가장 큰 거리 값을 가지는 질의 순으로 재배치하는 단계를 포함하고, 상기 거리 값은 상기 배치에 포함된 적어도 하나의 질의 각각에 대해 상기 원본 히스토그램과 가상 히스토그램의 차이로 계산할 수 있다.The processing of the one query may include: calculating a distance value if the difference between the first result value and the second result value is greater than the threshold value, and having the largest distance value based on the calculated distance value. And relocating in order of query, wherein the distance value is calculated as a difference between the original histogram and the virtual histogram for each of at least one query included in the batch.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 또 다른 특징인 차분 프라이버시 질의를 처리하는 질의 처리 시스템으로서,As a query processing system for processing a differential privacy query that is another feature of the present invention for achieving the technical problem of the present invention,

수신한 복수의 질의 각각에 대한 최대 갱신 횟수를 계산하고, 계산한 최대 갱신 횟수와 미리 설정된 전체 프라이버시 예산을 기초로 상기 복수의 질의 각각에 대한 예산을 할당하는 예산 할당부, 예산이 할당된 복수의 질의 중 적어도 하나의 질의를 하나의 배치에 포함시키는 배치 크기 설정부, 그리고 상기 하나의 배치에 포함된 적어도 하나의 질의를 처리하는 질의 처리부를 포함한다.A budget allocator configured to calculate a maximum update count for each of the plurality of received queries, and allocate a budget for each of the plurality of queries based on the calculated maximum update count and a preset total privacy budget; A batch size setting unit including at least one query among the queries in one batch, and a query processing unit processing at least one query included in the one batch.

상기 예산 할당부는, 제1 히스토그램과 초기화된 제2 히스토그램 사이의 상대 엔트로피의 상한 값과, 히스토그램 매개 변수, 상기 복수의 질의의 개수, 그리고 상기 임게값을 벗어나는 라플라스 분포 영역을 산출한 값을 기초로 상기 최대 갱신 횟수를 계산할 수 있다.The budget allocator is based on an upper limit value of a relative entropy between a first histogram and an initialized second histogram, a histogram parameter, the number of the plurality of queries, and a value of a Laplace distribution region that is out of the threshold value. The maximum update count may be calculated.

상기 배치 크기 설정부는, 포와송 분포에 따라 질의가 진입할 확률이 질의가 임의로 진입할 확률인 균일 분포 값 보다 높아질 때까지 적어도 하나의 질의를 상기 배치에 포함시킬 수 있다.The batch size setting unit may include at least one query in the batch until a probability that the query enters according to the Poisson distribution becomes higher than a uniform distribution value that is a probability that the query randomly enters.

상기 배치 크기 설정부는, 상기 배치의 크기를 최대 갱신 횟수와 질의 개수 그리고 상기 전체 프라이버시 예산을 토대로 최대 임계 값으로 설정할 수 있다. The batch size setting unit may set the size of the batch to a maximum threshold value based on the maximum number of updates, the number of queries, and the total privacy budget.

본 발명에 따르면 한정된 프라이버시 예산을 고려하여 배치 처리를 수행하여, 동일한 수의 질의 처리시 보다 적은 예산으로 질의를 처리할 수 있다.According to the present invention, a batch process may be performed in consideration of a limited privacy budget, so that a query may be processed at a smaller budget when processing the same number of queries.

도 1은 본 발명의 실시예에 따른 질의 처리 시스템의 구조도이다.
도 2는 본 발명의 실시예에 따른 프라이버시 예산 절약 방법에 대한 흐름도이다.
도 3은 본 발명의 실시예에 따른 실제 갱신 횟수와 추론된 갱신 횟수를 비교한 예시도이다.
도 4는 본 발명의 실시예에 따른 방법과 일반적인 방법에 따른 갱신 횟수를 비교한 예시도이다.
도 5는 본 발명의 실시예에 따른 다양한 히스토그램과 라플라스 분포에서 임계값을 벗어난 영역에 대한 예시도이다.1 is a structural diagram of a query processing system according to an embodiment of the present invention.
2 is a flowchart illustrating a privacy budget saving method according to an embodiment of the present invention.
3 is an exemplary view comparing actual update times and inferred update times according to an embodiment of the present invention.
4 is an exemplary view comparing update times according to a method and a general method according to an exemplary embodiment of the present invention.
5 is an exemplary diagram of a region outside the threshold in various histogram and Laplace distributions according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

이하, 도면을 참조로 하여 본 발명의 실시예에 따른 차분 프라이버시 기반 질의 처리 시스템 및 이를 이용한 프라이버시 예산 절약 방법에 대해 상세히 설명한다.Hereinafter, a differential privacy based query processing system and a privacy budget saving method using the same according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 질의 처리 시스템의 구조도이다.1 is a structural diagram of a query processing system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 질의 처리 시스템(100)은 적어도 하나의 프로세서에 의해 동작하고, 예산 할당부(110), 배치 크기 설정부(120), 그리고 질의 처리부(130)을 포함한다.As shown in FIG. 1, the query processing system 100 is operated by at least one processor and includes a budget allocator 110, a batch size setting unit 120, and a query processing unit 130.

예산 할당부(110)는 사용자 단말(도면 미도시) 또는 외부로부터 복수의 질의를 수신한다. 이때, 예산 할당부(110)가 수신한 복수의 질의는 선형 질의를 예로 하여 설명한다.The budget allocator 110 receives a plurality of queries from a user terminal (not shown) or the outside. In this case, the plurality of queries received by the budget allocator 110 will be described using a linear query as an example.

예산 할당부(110)는 수신한 질의의 개수와, 미리 허용된 전체 예산 정보를 토대로, 최대 갱신 횟수를 계산한다. 그리고 계산한 최대 갱신 횟수로 예산을 분할하여, 하나의 질의에 할당될 예산을 결정한다. The budget allocator 110 calculates a maximum update count based on the number of received queries and the total budget information previously allowed. The budget is divided by the calculated maximum update times to determine the budget to be allocated to one query.

즉, 종래에는 질의의 개수를 전체 예산으로 나눈 만큼의 예산이 하나의 질의에 할당하였다. 이 경우, 질의의 개수가 늘어날수록 하나의 질의에 할당되는 예산이 매우 작아지기 때문에, 노이즈가 과도하게 삽입되어 질의의 정확도가 감소되는 단점이 있다.That is, in the related art, as much as the budget divided by the total budget is allocated to one query. In this case, since the budget allocated to one query becomes very small as the number of queries increases, there is a disadvantage in that noise is excessively inserted to reduce the accuracy of the query.

따라서, 예산 할당부(110)는 단순히 질의의 개수로 전체 예산을 나누는 대신, 먼저 최대 갱신 횟수를 계산한다. 그리고, 계산한 최대 갱신 횟수를 토대로 예산을 분할한다. Accordingly, the budget allocator 110 first calculates the maximum update number instead of dividing the entire budget by the number of queries. The budget is divided based on the calculated maximum update count.

배치 크기 설정부(120)는 예산 할당부(110)에서 할당된 예산을 토대로, 배치 크기를 결정한다. 배치 크기는 클수록 질의에 대한 예산 소모를 최적화할 수 있으나, 배치 크기가 지나치게 커지면 각 질의의 반응 시간(response time)이 늘어나는 문제가 발생한다. The batch size setting unit 120 determines the batch size based on the budget allocated by the budget allocating unit 110. The larger the batch size, the more the budget consumption for the query can be optimized. However, if the batch size is too large, the response time of each query increases.

따라서, 배치 크기 설정부(120)는 질의의 모수와 포아송 분포에 따라 질의가 진입할 확률을 계산한다. 그리고, 계산한 확률을 토대로 배치 크기 설정부(120)는 배치 크기의 최대 임계 값을 설정한다. 배치 크기 설정부(120)는 설정한 배치 크기의 최대 임계 값을 이용하여, 질의에 대한 배치 크기를 결정한다. 이에 대해서는 이후 상세히 설명한다.Therefore, the batch size setting unit 120 calculates a probability that the query enters according to the parameter of the query and the Poisson distribution. The batch size setting unit 120 sets the maximum threshold value of the batch size based on the calculated probability. The batch size setting unit 120 determines the batch size for the query using the set maximum threshold value of the batch size. This will be described later in detail.

그리고 배치 크기 설정부(120)는 질의가 있을 때 가상 히스토그램으로부터의 질의 결과와 원본 히스토그램으로부터의 질의 결과에 노이즈를 삽입한 값을 비교한다. 그리고, 비교한 값이 미리 설정된 임계값보다 크면 상대 엔트로피가 크다고 판단하여, 질의를 거리 값(distance)이 큰 질의 기준으로 재정렬하여 가상 히스토그램을 갱신한다. 이를 토대로 적은 수의 갱신으로도 가상 히스토그램을 원본 히스토그램과 유사하게 만들어, 프라이버시 예산을 절약한다. 이에 대해서는 이후 상세히 설명한다.The batch size setting unit 120 compares the query result from the virtual histogram with the noise inserted into the query result from the original histogram when there is a query. If the comparison value is larger than the preset threshold value, the relative entropy is determined to be large, and the query is rearranged based on a query criterion having a large distance value to update the virtual histogram. Based on this, a small number of updates makes the virtual histogram similar to the original histogram, saving privacy budget. This will be described later in detail.

질의 처리부(130)는 예산 할당부(110)가 예산을 분할하고, 배치 크기 설정부(120)가 질의에 대한 배치 크기를 설정하고 질의를 재정렬하면, 분할된 예산과 설정된 배치 크기를 기초로 질의를 처리하여 질의 응답 정보를 생성한다. 질의 처리부(130)가 질의를 처리하는 방법은 다양한 방법으로 수행할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하지 않는다.When the budget allocator 110 divides the budget, and the batch size setting unit 120 sets the batch size for the query and rearranges the query, the query processor 130 queries the query based on the divided budget and the set batch size. Process the to generate the query response information. Since the query processing unit 130 processes the query in various ways, the embodiment of the present invention is not limited to any one method.

또한, 본 발명의 실시예에서는 설명의 편의를 위하여 따로 도시하지 않았으나 질의 처리 시스템(100)을 구동하기 위해 필요한 프로그램들이 저장, 관리하는 메모리를 추가로 포함할 수 있다. 여기서 프로그램의 형태나 질의 응답 정보의 형태는 다양하므로, 본 발명의 실시예에서는 어느 하나의 형태로 한정하지 않는다.In addition, in the embodiment of the present invention, although not shown separately for convenience of description, a memory for storing and managing programs necessary for driving the query processing system 100 may be further included. Since the form of the program and the form of the question and answer information are various, the embodiment of the present invention is not limited to any one form.

이상에서 설명한 질의 처리 시스템(100)으로 프라이버시 예산을 절약하는 방법에 대해 도 2를 참조로 설명한다.A method of saving a privacy budget with the query processing system 100 described above will be described with reference to FIG. 2.

도 2는 본 발명의 실시예에 따른 프라이버시 예산 절약 방법에 대한 흐름도이다.2 is a flowchart illustrating a privacy budget saving method according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 외부로부터 복수의 질의를 수신하면(S100), 질의 처리 시스템(100)은 질의 개수를 확인한다. 그리고, 질의 처리 시스템(100)은 수신한 복수의 질의에서 최대 갱신 횟수를 계산한다(S101).As illustrated in FIG. 2, when receiving a plurality of queries from the outside (S100), the query processing system 100 checks the number of queries. In addition, the query processing system 100 calculates a maximum update count in the plurality of received queries (S101).

여기서 질의 처리 시스템(100)이 질의 개수를 확인하는 방법은 다양한 방법으로 확인할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하지 않는다. Here, since the query processing system 100 checks the number of queries in various ways, the embodiment of the present invention is not limited to any one method.

또한, 대화형 환경에서 질의 처리 시스템이 선형 질의를 처리할 때, 질의 처리 시스템(100)은 가상 히스토그램을 활용하여 예산의 소모를 줄인다. 이는 가상 히스토그램과 매개 변수들을 활용하여, 일정 조건 하에서 원본 히스토그램이 아닌 가상 히스토그램을 통해 질의 결과를 제공하여 예산의 소모를 줄인다.In addition, when the query processing system processes a linear query in an interactive environment, the query processing system 100 utilizes a virtual histogram to reduce the consumption of the budget. It utilizes virtual histograms and parameters to reduce budget consumption by providing query results through virtual histograms rather than original histograms under certain conditions.

이를 통해 질의 처리 시스템(100)은 질의가 저장된 실제 데이터베이스로부터 비롯된 원본 히스토그램과 별개로, 균일 분포 형태의 가상 히스토그램을 생성한다. 이후 가상 히스토그램으로부터 도출된 질의 결과와 원본 히스토그램으로부터 도출된 값에 노이즈를 삽입한 결과를 비교한다. 그리고 차이가 임계값(T) 이하면 가상 히스토그램으로부터 도출된 결과를 배포해 예산을 소모하지 않는다.Through this, the query processing system 100 generates a virtual histogram of a uniform distribution form separately from the original histogram derived from the actual database where the query is stored. After that, the noise result is compared to the query result derived from the virtual histogram and the value derived from the original histogram. If the difference is less than or equal to the threshold T, the results from the virtual histogram are distributed to avoid spending budget.

반면, 차이가 임계값 이상이면 원본 히스토그램으로부터 도출된 값에 노이즈를 삽입한 결과를 배포한다. 그리고 가상 히스토그램을 갱신하여 원본 히스토그램과 유사하게 만들어 나간다. 원본 히스토그램에 노이즈를 삽입한 결과를 배포할 때만 예산이 소모되며 대화형 환경에서 계속해서 유입되는 질의에 대해 위의 과정을 반복해서 수행한다. On the other hand, if the difference is greater than or equal to the threshold, the result of inserting noise into the value derived from the original histogram is distributed. The virtual histogram is then updated to make it similar to the original histogram. The budget is only spent distributing noise into the original histogram, and the above process is repeated for queries that continue to flow in the interactive environment.

이는 임계값(T)과 히스토그램 갱신 매커니즘을 설정하여, 본 발명의 실시예에 따른 질의 처리 방법이 (ε, δ)-차분 프라이버시를 만족시킴을 증명한다. 차분 프라이버시 만족 증명 방법은 다양한 방법으로 증명할 수 있으므로, 본 발명의 실시예에서는 상세한 설명을 생략한다. This establishes the threshold value T and histogram update mechanism, demonstrating that the query processing method according to the embodiment of the present invention satisfies (ε, δ) -differential privacy. Since the differential privacy satisfaction proof method can be proved by various methods, detailed description is omitted in the embodiment of the present invention.

그러나 동일한 질의 집합에 대해 이 방법을 수행하더라도, 진입하는 질의의 진입 순서에 따라 가상 히스토그램의 갱신 횟수가 판이하게 달라지며 이는 기법의 성능편차를 매우 크게 만든다. 이에 대해 도 5를 참조로 먼저 설명한다.However, even if this method is executed for the same set of queries, the number of update of the virtual histogram varies differently according to the order of entry of the entering queries, which greatly increases the performance deviation of the technique. This will be described first with reference to FIG. 5.

도 5는 본 발명의 실시예에 따른 다양한 히스토그램과 라플라스 분포에서 임계값을 벗어난 영역에 대한 예시도이다.5 is an exemplary diagram of a region outside the threshold in various histogram and Laplace distributions according to an embodiment of the present invention.

도 5의 (a)에 도시된 바와 같이 질의에 대한 원본 히스토그램이 존재한다고 가정한다. 그러면, 도 5의 (c-1)과 같이 초기값으로 정규화된 가상 히스토그램은 연속적으로 진입하는 질의에 의해 갱신이 일어난다. 만약 첫 번째 질의 Q1이 bin 2에 대한 질의이고, 두 번째 질의 Q2와 세 번째 질의 Q3이 각각 bin 7, bin 10에 대한 질의라고 가정한다.Assume that there is an original histogram for the query as shown in FIG. Then, as shown in (c-1) of FIG. 5, the virtual histogram normalized to the initial value is updated by a query that continuously enters. Suppose that the first query Q1 is a query for bin 2, and the second query Q2 and the third query Q3 are queries for bin 7, bin 10, respectively.

이 세 질의에 대하여 가상 히스토그램은 도 5의 (c-2), (c-3), (c-4)와 같이 세 번의 갱신이 일어나게 된다. 그러나 만약 일곱 번째 질의 Q7이 bin 7과 bin 10에 대한 질의라면, 도 5의 (b-2)와 같이 Q7를 통한 한 번의 갱신만으로도 Q1, Q2, Q3의 세 번의 갱신의 결과와 유사한 가상 히스토그램을 얻을 수 있다. The three histograms of the three queries are updated as shown in (c-2), (c-3), and (c-4) of FIG. 5. However, if the seventh query Q7 is a query for bin 7 and bin 10, a virtual histogram similar to the result of three updates of Q1, Q2, and Q3 is generated with only one update through Q7 as shown in (b-2) of FIG. You can get it.

배치를 사용하여 배치 내의 질의들을 효율적인 순서로 재배치하면 (ε, δ)-차분 프라이버시의 증명에 영향을 주지 않는다. 그리고, 일정 수준의 성능 수준을 유지할 수 있고, 비슷한 가상 히스토그램을 기존의 방법에 비해 더 적은 횟수의 갱신으로도 얻을 수 있기 때문에, 예산 소모도 줄일 수 있다. Using batch to rearrange queries in batch in an efficient order does not affect the proof of (ε, δ) -differential privacy. In addition, a certain level of performance can be maintained, and similar virtual histograms can be obtained with fewer updates than conventional methods, thereby reducing budget consumption.

상기 도 2를 이어 설명하면, 수신한 복수의 질의에서 최대 갱신 횟수를 계산하기 위하여, 질의 처리 시스템(100)은 초기 가상 히스토그램과 원본 히스토그램 사이의 상대 엔트로피의 상한 값과 질의에 대한 한 번의 히스토그램 갱신이 발생할 때 감소되는 상대 엔트로피 수를 이용한다.Referring to FIG. 2, in order to calculate the maximum number of updates in a plurality of received queries, the query processing system 100 updates the upper limit of relative entropy between the initial virtual histogram and the original histogram and one histogram update for the query. Use the relative entropy number that decreases when this occurs.

즉, 질의에 대해 한 번의 히스토그램이 갱신될 때 η²만큼의 상대 엔트로피 감소가 일어난다. 따라서, 초기 가상 히스토그램(y)와 원본 히스토그램(x) 사이의 상대 엔트로피의 상한은 logH가 된다. 이를 통해, 질의 처리 시스템(100)은 초기 가상 히스토그램이 원본 히스토그램과 유사해질 때 까지는 최대 logH/η²번 갱신이 시도됨을 계산할 수 있다.That is, relative entropy reduction by η ² occurs when one histogram is updated for a query. Thus, the upper limit of the relative entropy between the initial virtual histogram y and the original histogram x becomes logH. Through this, the query processing system 100 may calculate that the maximum logH / η ² update is attempted until the initial virtual histogram is similar to the original histogram.

그러나, 최대 갱신 시도 횟수는 라플라스 분포로부터 계산된 노이즈 삽입을 감안하지 않은 결과이다. 따라서 질의 처리 시스템(100)은 가상 히스토그램으로부터 비롯된 질의의 결과값과 원본 히스토그램으로부터 비롯된 값에 노이즈를 삽입한 결과를 비교하여 추가적인 갱신이 발생함을 인지한다. 여기서, 노이즈의 분포는 라플라스 분포를 따르므로, 임계값을 벗어나는 영역을 p라 할 때 다음 수학식 1과 같이 구할 수 있다.However, the maximum number of update attempts is the result of not considering the noise insertion calculated from the Laplace distribution. Accordingly, the query processing system 100 recognizes that an additional update occurs by comparing the result of inserting noise into the result of the query resulting from the virtual histogram and the value derived from the original histogram. Here, since the distribution of the noise follows the Laplace distribution, p can be obtained as shown in Equation 1 when p is a region deviating from the threshold.

따라서, N*p만큼의 추가적인 갱신이 발생할 경우, 최대

만큼의 갱신이 발생한다. Therefore, if additional updates of N * p occur,

As many updates occur.

질의 처리 시스템(100)은 수학식 1에서 구한 최대 갱신 발생 횟수를 토대로 하나의 질의에 할당할 예산을 결정한다(S102). 즉, 한 번의 갱신 과정에 할당될 수 있는 프라이버시 예산(UE)은 다음 수학식 2와 같다. 질의 처리 시스템(100)은 프라이버시 예산을 갱신할 때 마다 할당한다. The query processing system 100 determines a budget to allocate to one query based on the maximum number of update occurrences obtained in Equation 1 (S102). That is, the privacy budget UE that can be allocated to one update process is shown in Equation 2 below. The query processing system 100 allocates each time the privacy budget is updated.

S102 단계를 통해 예산이 결정되면, 질의 처리 시스템(100)은 배치 크기를 설정한다(S103). 배치 크기는 커질수록 질의에 따른 예산 소모를 최적화할 수 있으나, 배치의 크기가 지나치게 커지면 각 질의의 반응 시간이 늘어난다. 따라서, 배치 크기는 적절한 수준을 유지해야 한다.If the budget is determined through the step S102, the query processing system 100 sets the batch size (S103). As the batch size increases, the budget consumption of the query can be optimized. However, when the batch size becomes too large, the response time of each query increases. Therefore, batch size should be maintained at an appropriate level.

질의 처리 시스템(100)가 배치 크기를 설정하기 위해, 본 발명의 실시예에서는 질의는 모수가 λ인 포아송 분포에 따라 질의가 진입한다고 가정한다. 분포에 의해 이전 배치가 m번째 질의까지 처리했을 때, 그 다음 질의부터 n번째 질의까지 질의가 진입할 확률은 다음 수학식 3과 같다.In order for the query processing system 100 to set a batch size, in the embodiment of the present invention, it is assumed that a query enters according to a Poisson distribution whose parameter is λ. When the previous batch processes the mth query by the distribution, the probability that the query enters from the next query to the nth query is expressed by Equation 3 below.

여기서 f(x)는 모수가 λ일 때, x번째 사건이 발생할 확률을 의미한다.Where f (x) is the probability that the x-th event will occur when the parameter is λ.

질의 처리 시스템(100)은 수학식 3의 포아송 분포에 따라, 질의가 진입할 확률이 질의가 임의로 진입할 확률인 균일 분포 값보다 높아질 때가지 질의들을 하나의 배치에 포함시킨다. 이때, 포아송 분포의 경우 모수에 따라 전체 질의까지 모두 들어올 확률인 P(X ≤ N) = α가 1보다 작을 수 있다. 따라서 포아송 분포 값과 균일 분포 값을 비교하여 다음 수학식 4의 조건이 만족될 때까지 질의들을 배치에 포함시킨다.According to the Poisson distribution of Equation 3, the query processing system 100 includes the queries in one batch until the probability that the query enters becomes higher than the uniform distribution value that is the probability that the query enters arbitrarily. At this time, in the case of the Poisson distribution, P (X ≦ N) = α, which is a probability that all the queries are entered according to the parameter, may be smaller than 1. Therefore, the Poisson distribution value is compared with the uniform distribution value, and the queries are included in the batch until the condition of Equation 4 is satisfied.

단. 포와송 분포의 특성상 모수와 차이가 심할수록 확률 값이 급격이 감소하기 때문에, 배치의 크기가 매우 커질 수 있다. 그러므로 일정 수준의 질의 반응시간을 유지하기 위해 배치 크기의 최대 임계 값은 다음 수학식 5와 같이 설정한다.only. Due to the nature of the Poisson distribution, the larger the difference from the parameter, the smaller the probability value, so the size of the batch can be very large. Therefore, in order to maintain a certain level of query response time, the maximum threshold value of the batch size is set as in Equation 5 below.

ε이 커질 때 최대 갱신 횟수는 감소하고, 질의 수가 많아지더라도 최대 갱신 횟수는 증가한다. 따라서 최대 임계 값은 e나 N의 변화에도 일정 수준의 질의 반응시간을 유지할 수 있는 배치의 크기를 제공한다. S102 단계에서 결정한 예산과 S103 단계에서 설정한 배치 크기를 고려한 알고리즘은 다음 표 1과 같다.When ε becomes large, the maximum update count decreases, and even when the number of queries increases, the maximum update count increases. Therefore, the maximum threshold provides a batch size that can maintain a certain level of query response time even with e or N changes. The algorithm considering the budget determined in step S102 and the batch size set in step S103 is shown in Table 1 below.

Require: x, ε, Q, H, E, N
x: 원본 히스토그램, ε: 질의 당 요구되는 프라이버시 수준
y: 가상 히스토그램, T: 결과값 차이 임계값

: 질의 집합, N: 질의 개수
H: 원본 히스토그램 bin 수, B: 배치 크기
E: 전체 프라이버시 예산, n: 데이터베이스 레코드 수

Predefine: η, σ, T

: 히스토그램 갱신 변수

: 라플라스 분포 분산

: 결과값 차이 임계값Require: x, ε, Q, H, E, N
x: text histogram, ε: privacy level required per quality
y: virtual histogram, T: result difference threshold

: Query set, N: number of queries
H: number of original histogram bins, B: batch size
E: total privacy budget, n: number of database records

Predefine: η, σ, T

: Histogram update variables

: Laplace Distribution Variance

: Result Difference Threshold

표 1에 나타낸 바와 같이, 예산 할당과 배치를 고려한 알고리즘을 실행하기 위해, 원본 히스토그램(x), 질의 당 요구되는 프라이버시 수준(ε), 가상 히스토그램(y), 결과값 차이 임계값(T), 질의 집합(Q), 질의 개수(N), 원본 히스토그램 bin 수, 배치 크기(B), 전체 프라이버시 예산(E), 데이터베이스 레코드 수(n)를 입력 정보로 한다. 또한, 사전 정의(predefine) 값으로 히스토그램 갱신 매개 변수(η), 라플라스 분포 분산(σ), 결과값 차이 임계값(T)이 있다.As shown in Table 1, the original histogram (x), the required privacy level per query (ε), the virtual histogram (y), the resulting difference threshold (T), in order to implement the algorithm considering budget allocation and placement, The query set (Q), the number of queries (N), the original histogram bin number, the batch size (B), the total privacy budget (E), and the number of database records (n) are input information. Predefine values also include a histogram update parameter η, a Laplace distribution variance σ, and a resultant difference threshold T.

표 1에 도시된 알고리즘을 구동하면서, 질의 처리 시스템(100)은 하나의 배치 내에 포함된 적어도 하나의 질의들을 재정렬한다. 즉, 질의 처리 시스템(100)은 가상 히스토그램의 계속된 갱신을 통해 가상 히스토그램을 원본 히스토그램과 유사하게 만들어 나간다. 이는 가상 히스토그램과 원본 히스토그램의 차이를 상대 엔트로피(Relative entropy)의 개념으로 나타낼 때, 질의마다 변화시킬 수 있는 상대 엔트로피의 변화량이 다르기 때문이다. Running the algorithm shown in Table 1, query processing system 100 rearranges at least one query contained in one batch. That is, the query processing system 100 makes the virtual histogram similar to the original histogram through the continuous updating of the virtual histogram. This is because when the difference between the virtual histogram and the original histogram is expressed in terms of relative entropy, the amount of change in relative entropy that can be changed for each query is different.

따라서, 하나의 배치 내의 질의 중에서 상대 엔트로피를 가장 크게 변화시키는 질의를 기준으로 질의를 재배치하여, 가상 히스토그램 갱신을 수행한다. 여기서, 질의 재배치를 위해 질의 처리 시스템(100)은 질의를 거리 값(distance)을 계산하는데, 거리 값(d_i)은 i번째 질의에 대한 원본 히스토그램과 가상 히스토그램의 차이인 a_i-<q_i, y>로 계산하고, a_i = <q_i, x>+A_i이며, A_i는 노이즈가 삽입된 결과 값을 의미한다. 이를 토대로 적은 수의 가상 히스토그램 갱신으로도 가상 히스토그램을 원본 히스토그램과 유사하게 만들 수 있으며, 이를 통해 프라이버시 예산을 절약한다.Therefore, the virtual histogram update is performed by rearranging the query based on the query that changes the relative entropy most significantly among the queries in one batch. Here, for query relocation, the query processing system 100 calculates a query distance value, where the distance value d _i is a _i- <q _i which is the difference between the original histogram and the virtual histogram for the i th query. , y> and a _i = <q _i , x> + A _i , where A _i is the result of noise insertion. Based on this, a small number of virtual histogram updates can make a virtual histogram similar to the original histogram, which saves privacy budget.

한편 상기 도 2의 S103 단계에서 배치 크기가 설정되면, 질의 처리 시스템(100)은 S100 단계에서 수신한 질의를 처리하여 질의한 사용자의 단말(도면 미도시)에 제공한다(S104).On the other hand, if the batch size is set in step S103 of FIG. 2, the query processing system 100 processes the query received in step S100 and provides the query to the user's terminal (not shown) (S104).

다음은, 본 발명의 실시예에 따른 실제 갱신 횟수와 추론된 갱신 횟수를 비교한 예에 대해 도 3을 참조로 설명하고, 본 발명의 실시예에 따른 방법과 일반적인 방법에 따른 갱신 횟수의 비교에 대해 도 4를 참조로 설명한다.Next, an example in which the actual number of updates according to an embodiment of the present invention is compared with the inferred number of updates will be described with reference to FIG. 3, and the comparison between the number of updates according to the method and the general method according to the embodiment of the present invention is described below. This will be described with reference to FIG. 4.

도 3은 본 발명의 실시예에 따른 실제 갱신 횟수와 추론된 갱신 횟수를 비교한 예시도이다. 그리고, 도 4는 본 발명의 실시예에 따른 방법과 일반적인 방법에 따른 갱신 횟수를 비교한 예시도이다.3 is an exemplary view comparing actual update times and inferred update times according to an embodiment of the present invention. 4 is an exemplary view comparing update times according to a method and a general method according to an exemplary embodiment of the present invention.

본 발명의 실시예에서 사용된 데이터는 StatLib의 The National Long-Term Case Study(NLTCS)의 데이터로써, 16개의 이진 속성으로 이루어졌으며 21674개의 레코드로 구성되어 있다. 프라이버시 보호수준 ε 의 변화에 대한 추론된 갱신횟수의 정확도 측정과, 프라이버시 예산의 소모에 대한 실험이며, δ 는 0.001로 고정한 것을 예로 하여 설명한다. The data used in the embodiment of the present invention is data from The National Long-Term Case Study (NLTCS) of StatLib, which consists of 16 binary attributes and 21674 records. This is an example of measuring the accuracy of the inferred update frequency for the change in the privacy protection level ε, and experimenting on the consumption of the privacy budget, with δ fixed at 0.001.

또한, 본 발명의 실시예에 따른 갱신 횟수는 예산 소모를 의미한다. 포아송 분포의 모수는 질의의 수로 설정하였으며, 10,000개의 임의의 선형 질의가 사용된 것을 예로 하여 설명한다. In addition, the update frequency according to the embodiment of the present invention means budget consumption. The parameters of the Poisson distribution are set by the number of queries, and the description will be given by using 10,000 arbitrary linear queries as an example.

먼저 도 3에 도시된 바와 같이, ε가 변화하더라도 실제 갱신횟수와 거의 유사한 값을 추론하는 것을 확인 할 수 있다. 따라서 본 발명의 실시예에 따른 방법으로 계산한 갱신 횟수에 기반하여, 각 갱신 단계에 할당하는 예산이 적합한 것을 확인할 수 있다. 이를 통해 달성하고자 하는 프라이버시 수준(ε)에 대하여, 각 갱신단계에 할당할 수 있는 예산을 정해줄 수 있으며 이는 예산이 한정되어 있을 때 매우 유용하다. First, as shown in FIG. 3, even if ε changes, it can be confirmed that a value almost similar to the actual update frequency is inferred. Therefore, based on the update count calculated by the method according to the embodiment of the present invention, it can be confirmed that the budget allocated to each update step is appropriate. This allows you to set a budget that can be allocated to each renewal stage for the level of privacy ε you want to achieve, which is very useful when the budget is limited.

그리고 도 4에 도시된 바와 같이, 동일한 프라이버시 성능에 대하여 기존기법의 예산 소모 정도와 제안 기법의 예산 소모 정도를 비교한 예시도를 살펴보면, 모든 구간의 프라이버시 보호 수준에 대하여 배치를 통한 질의의 재배열이 실제로 갱신횟수를 줄이는 데 효과적이라는 것을 확인 할 수 있다. 그리고, 무조건적으로 배치의 크기를 늘린 것이 아니라, 적절한 임계값의 설정을 통해 질의의 반응시간을 과도하게 늘리지 않는다. 이를 통해 본 발명의 실시예에 따른 방법이 기존 방법에 비해 동일 성능의 프라이버시 성능에 대하여 소모되는 예산이 적은 것을 확인할 수 있다. And, as shown in Figure 4, looking at the example of comparing the budget consumption of the proposed method and the budget consumption of the existing technique for the same privacy performance, rearrangement of the query through the batch for the privacy protection level of all intervals You can see that this is actually effective in reducing the number of updates. Instead of unconditionally increasing the size of the batch, setting the appropriate thresholds does not excessively increase the response time of the query. Through this, it can be seen that the method according to the embodiment of the present invention consumes less budget for the same privacy performance than the existing method.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

As a query processing system saves privacy budgets,
Calculating a maximum update count for each of a plurality of received queries based on an original histogram and a virtual histogram generated from the original histogram;
Allocating a budget for each of the plurality of queries based on the calculated maximum update count and a predetermined overall privacy budget;
Including at least one query of the plurality of budgeted queries in one batch, and
Relocating at least one query included in the one batch and processing the relocated at least one query
Privacy budget saving method that includes.

The method of claim 1,
The calculating of the maximum update number may include:
Generating a second histogram, the virtual histogram, having a uniform distribution form different from the first histogram, the original histogram, from a database in which the plurality of queries are received;
Calculating a first result value that is a query result value derived from the second histogram and a second result value that is a query result value derived from the first histogram;
Checking whether a difference between the first result value and the second result value deviates from a preset threshold value, and
If the difference deviates from the threshold, distributing a result of inserting noise into the second result value and updating the virtual histogram
Privacy budget saving method that includes.

The method of claim 2,
The calculating of the maximum update number may include:
The maximum update count is based on an upper limit value of a relative entropy between the first histogram and the initialized second histogram, a histogram parameter, the number of the plurality of queries, and a Laplace distribution region that is out of the threshold. How privacy budgets are determined.

The method of claim 1,
Including in one batch,
Including at least one query in the batch until the probability that the query enters according to the Poisson distribution is higher than the uniform distribution value that is the probability that the query enters arbitrarily;
Privacy budget saving method that includes.

The method of claim 4, wherein
Including in the batch,
Setting a maximum threshold value of a batch size based on the maximum number of updates, the number of queries and the overall privacy budget
Privacy budget saving method that includes more.

The method of claim 2,
The step of processing the one query,
Calculating a distance value if the difference between the first result value and the second result value is greater than the threshold value, and
Relocating the query with the largest distance value based on the calculated distance value
Including,
And the distance value is calculated as a difference between the original histogram and the virtual histogram for each of at least one query included in the batch.

A query processing system for processing differential privacy queries,
For each of the plurality of received queries, a maximum update count is calculated based on the original histogram and the virtual histogram generated from the original histogram, and for each of the plurality of queries based on the calculated maximum update count and a predetermined overall privacy budget. A budget allocator that allocates a budget,
A batch size setting unit including at least one query among a plurality of budget-allocated queries in one batch, and
Query processing unit for processing at least one query included in the one batch
Query processing system comprising a.

The method of claim 7, wherein
The budget allocation unit,
The upper limit value of the relative entropy between the first histogram, the original histogram, and the second histogram, from which the virtual histogram is initialized, a histogram parameter, the number of the plurality of queries, and a Laplace distribution region that is out of a threshold value. Query processing system for calculating the maximum number of updates based on.

The method of claim 8,
The first histogram is an original histogram from a database in which the received plurality of queries are stored.
And the second histogram is a virtual histogram of a uniform distribution form different from the first histogram.

The method of claim 9,
The batch size setting unit,
And at least one query is included in the batch until a probability that the query enters according to the Poisson distribution is higher than a uniform distribution value that is a probability that the query enters arbitrarily.

The method of claim 10,
The batch size setting unit,
And the size of the batch is set to a maximum threshold value based on the maximum number of updates, the number of queries, and the total privacy budget.

The method of claim 11,
The batch size setting unit,
Calculating a distance value between an original histogram and a virtual histogram for each of at least one query included in the batch, and relocating the query in the order of the query having the largest distance value based on the distance value.