KR102151318B1

KR102151318B1 - Method and apparatus for malicious detection based on heterogeneous information network

Info

Publication number: KR102151318B1
Application number: KR1020180165522A
Authority: KR
Inventors: 김성열; 은상남; 진치국; 강호석
Original assignee: 건국대학교 산학협력단
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2020-09-02
Anticipated expiration: 2038-12-19
Also published as: KR20200076426A

Abstract

이종 정보 네트워크 기반 악성 코드 탐지 방법 및 장치가 개시된다. 일 실시예에 따른 악성 소프트웨어 탐지 방법은 PE(Portable Executable) 파일들로부터 특성들을 추출하는 단계와, 상기 PE 파일들과 상기 특성들 간의 관계에 대한 HIN(heterogeneous information network)을 생성하는 단계와, 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 단계를 포함한다.A method and apparatus for detecting malicious codes based on heterogeneous information networks are disclosed. A method for detecting malicious software according to an embodiment includes the steps of extracting characteristics from Portable Executable (PE) files, generating a heterogeneous information network (HIN) for a relationship between the PE files and the characteristics, and And detecting whether the PE files are malicious software by using the meta path on the HIN.

Description

Malware detection method and device based on heterogeneous information network {METHOD AND APPARATUS FOR MALICIOUS DETECTION BASED ON HETEROGENEOUS INFORMATION NETWORK}

아래 실시예들은 이종 정보 네트워크 기반 악성 코드 탐지 방법 및 장치에 관한 것이다.The following embodiments relate to a method and apparatus for detecting malicious codes based on heterogeneous information networks.

소프트웨어 내 악성 코드를 분석하기 위한 방법에는 지문(signature) 검사법, CRC(Cyclic Redundancy Check) 검사법, 및 경험적(heuristic) 검사법 등이 있다.Methods for analyzing malicious codes in software include a fingerprint inspection method, a cyclic redundancy check (CRC) inspection method, and a heuristic inspection method.

지문 검사법은 사람을 구별할 때 지문을 보듯이 보안 프로그램이 악성 코드를 진단하는 방법 중의 한 가지이다. 즉, 악성 코드가 가지고 있는 독특한 문자열(패턴)을 수집하여 이를 데이터베이스에 저장하고, 보안 프로그램이 패턴을 매칭하는 방법을 이용하여 악성 코드를 분석한다.Fingerprint scanning is one of the ways in which a security program diagnoses malicious code, just as a fingerprint is seen when identifying a person. That is, a unique character string (pattern) possessed by the malicious code is collected and stored in the database, and the malicious code is analyzed using a method in which a security program matches the pattern.

CRC 검사법은 시리얼 전송에서 데이터의 신뢰성을 검증하기 위한 에러 검출 방법의 일종으로 오진율이 낮다는 장점이 있으나, 데이터가 1 바이트라도 변형되면 악성 코드를 진단할 수 있는 단점이 있다.The CRC check method is a type of error detection method for verifying the reliability of data in serial transmission, and has an advantage of low error rate, but has a disadvantage of diagnosing malicious code if data is deformed even by 1 byte.

최근에는 악성 코드 분석 방법으로서 지문 검사법의 기능을 향상시킨 경험적 기법이 주로 사용되는데, 이는 악성 코드의 행동을 분석하거나 방식을 분석하여 자체적으로 학습하는 학습기반 분석법 중 하나이다. 즉, 악성 소프트웨어의 경우 독특한 조합의 API 명령을 사용하는 경우가 많은데, 경험적 기법은 이와 같이 독특한 API 명령의 조합을 학습하여 API 명령을 기반으로 악성 코드 여부를 판단한다.Recently, as a malicious code analysis method, an empirical technique that improves the function of the fingerprint inspection method is mainly used, which is one of the learning-based analysis methods that analyzes the behavior of malicious code or analyzes the method and learns itself. That is, in the case of malicious software, a unique combination of API commands is often used, and the empirical technique learns the combination of such unique API commands to determine whether or not a malicious code is based on the API command.

인터넷의 급속한 발전으로 다양한 유형의 보안 위협이 급속하게 증가했으며, 그 중 전통적인 PC 플랫폼의 악성 소프트웨어가 가장 많이 보급되었다. 복잡한 포장, 혼란, 안티-샌드 박싱(anti-sandboxing), 가상 침투(virtual penetration) 및 기타 기술 개발로 인해, 기존의 악성 소프트웨어 탐지 방법은 만족스럽지 못하다.With the rapid development of the Internet, various types of security threats have rapidly increased, and among them, malicious software of the traditional PC platform has become the most popular. Due to complex packaging, confusion, anti-sandboxing, virtual penetration and other technology developments, the existing methods of detecting malicious software are not satisfactory.

정보 네트워크 시대에서, 점점 더 많은 악성 소프트웨어(malicious software)는 보안에 심각한 위협을 가하고 있다. 적시에 효과적인 방법으로 악성 소프트웨어 공격을 탐지하는 방법이 특히 중요하다. 점점 더 정교 해지는 악성 소프트웨어에 대해서, 새로운 공격과 위협을 탐지하고 대처할 수 있는 새로운 방어 기술이 요구된다.In the age of information networks, more and more malicious software poses a serious threat to security. Of particular importance is how to detect malicious software attacks in a timely and effective manner. With increasingly sophisticated malicious software, new defense technologies are required to detect and respond to new attacks and threats.

실시예들은 PE 파일을 분석하여 특성들을 추출하고 특성들 간의 관계에 대한 HIN를 구축한 다음, 메타 경로 기반 방법을 사용하여 해당 PE 파일이 악성 소프트웨어에 해당하는지 탐지하는 기술을 제공할 수 있다.Embodiments may provide a technique for analyzing a PE file to extract characteristics, construct a HIN for a relationship between the characteristics, and then detect whether a corresponding PE file corresponds to malicious software using a meta path-based method.

일 실시예에 따른 악성 소프트웨어 탐지 방법은 PE(Portable Executable) 파일들로부터 특성들을 추출하는 단계와, 상기 PE 파일들과 상기 특성들 간의 관계에 대한 HIN(heterogeneous information network)을 생성하는 단계와, 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 단계를 포함한다.A method for detecting malicious software according to an embodiment includes the steps of extracting characteristics from Portable Executable (PE) files, generating a heterogeneous information network (HIN) for a relationship between the PE files and the characteristics, and And detecting whether the PE files are malicious software by using the meta path on the HIN.

상기 특성들은 PE 헤더 정보, API 호출(call), DLL 및 Opcode 시퀀스를 포함할 수 있다.The characteristics may include PE header information, API call, DLL and Opcode sequence.

상기 탐지하는 단계는 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들 간의 유사성을 계산함으로써 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 단계를 포함할 수 있다.The detecting step may include detecting whether the PE files are malicious software by calculating similarity between the PE files using the meta path on the HIN.

상기 관계는 PE 파일과 API 호출에 대한 제1 관계, PE 파일의 PE 헤더 정보의 속성값에 대한 제2 관계, API 호출이 속하는 패키지에 대한 제3 관계, PE 파일의 API 시퀀스에 대한 제4 관계, 및 PE 파일의 Opcode 시퀀스에 대한 제5 관계를 포함할 수 있다.The relationship is the first relationship between the PE file and the API call, the second relationship to the attribute value of the PE header information of the PE file, the third relationship to the package to which the API call belongs, and the fourth relationship to the API sequence of the PE file. , And a fifth relationship to the Opcode sequence of the PE file.

상기 메타 경로는 상기 제1 관계를 표현하는 제1 매트릭스를 통해 구성된 제1 메타 경로, 상기 제2 관계를 표현하는 제2 매트릭스를 통해 구성된 제2 메타 경로, 상기 제3 관계를 표현하는 제3 매트릭스를 통해 구성된 제3 메타 경로, 상기 제4 관계를 표현하는 제4 매트릭스를 통해 구성된 제4 메타 경로, 및 상기 제5 관계를 표현하는 제5 매트릭스를 통해 구성된 제5 메타 경로를 포함할 수 있다.The meta path is a first meta path configured through a first matrix representing the first relationship, a second meta path configured through a second matrix representing the second relationship, and a third matrix representing the third relationship. A third meta-path configured through a third meta-path, a fourth meta-path configured through a fourth matrix expressing the fourth relationship, and a fifth meta-path configured through a fifth matrix expressing the fifth relationship may be included.

상기 메타 경로는 다중 커널 학습을 통해 상기 제1 메타 경로, 상기 제2 메타 경로, 상기 제3 메타 경로, 상기 제4 메타 경로, 및 상기 제5 메타 경로를 최적화하여 선형 결합한 메타 경로를 더 포함할 수 있다.The meta-path may further include a meta-path that is linearly combined by optimizing the first meta-path, the second meta-path, the third meta-path, the fourth meta-path, and the fifth meta-path through multi-kernel learning. I can.

상기 메타 경로는 상기 제1 관계를 표현하는 제1 매트릭스를 통해 구성된 제1 메타 경로, 상기 제2 관계를 표현하는 제2 매트릭스를 통해 구성된 제2 메타 경로, 상기 제3 관계를 표현하는 제3 매트릭스를 통해 구성된 제3 메타 경로, 상기 제4 관계를 표현하는 제4 매트릭스를 통해 구성된 제4 메타 경로, 및 상기 제5 관계를 표현하는 제5 매트릭스를 통해 구성된 제5 메타 경로를 다중 커널 학습을 통해 최적화하여 선형 결합한 메타 경로일 수 있다.The meta path is a first meta path configured through a first matrix representing the first relationship, a second meta path configured through a second matrix representing the second relationship, and a third matrix representing the third relationship. Through multi-kernel learning, a third meta-path constructed through, a fourth meta-path constructed through a fourth matrix expressing the fourth relationship, and a fifth meta-path constructed through a fifth matrix expressing the fifth relationship It may be a meta-path that is optimized and linearly combined.

일 실시예에 따른 악성 소프트웨어 탐지 장치는 PE(Portable Executable) 파일들을 수신하는 수신기와, 상기 PE 파일들로부터 특성들을 추출하고, 상기 PE 파일들과 상기 특성들 간의 관계에 대한 HIN(heterogeneous information network)을 생성하고, 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 컨트롤러를 포함한다.A malicious software detection apparatus according to an embodiment includes a receiver for receiving Portable Executable (PE) files, extracts characteristics from the PE files, and provides a heterogeneous information network (HIN) on the relationship between the PE files and the characteristics. And a controller that generates and detects whether the PE files are malicious software using the meta path on the HIN.

상기 컨트롤러는 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들 간의 유사성을 계산함으로써 상기 PE 파일들이 악성 소프트웨어인지 탐지할 수 있다.The controller may detect whether the PE files are malicious software by calculating similarity between the PE files using the meta path on the HIN.

도 1은 일 실시예에 따른 악성 소프트웨어 탐지 방법을 수행하는 장치를 나타낸다.
도 2는 도 1에 도시된 악성 소프트웨어 탐지 장치의 개략적인 블록도이다.
도 3은 도 2에 도시된 컨트롤러의 개략적인 블록도이다.
도 4는 도 3에 도시된 PE 파일 분석기의 개락적인 블록도를 나타낸다.
도 5a 및 도 5b는 도 3에 도시된 PE 파일 분석기가 특성을 추출하는 동작의 일 예를 설명하기 위한 도면이다.
도 6은 도 3에 도시된 HIN 생성기의 동작을 설명하기 위한 도면이다.
도 7은 도 3에 도시된 다중 커널 학습기 및 분류 모델에 대해 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 악성 소프트웨어 탐지 방법에 의해 수행된 실험 결과를 설명하기 위한 그래프이다.1 shows an apparatus for performing a malicious software detection method according to an exemplary embodiment.
FIG. 2 is a schematic block diagram of a malicious software detection apparatus shown in FIG. 1.
3 is a schematic block diagram of the controller shown in FIG. 2.
FIG. 4 is a schematic block diagram of the PE file analyzer shown in FIG. 3.
5A and 5B are diagrams for explaining an example of an operation of extracting characteristics by the PE file analyzer illustrated in FIG. 3.
6 is a diagram for explaining the operation of the HIN generator shown in FIG. 3.
FIG. 7 is a diagram illustrating a multi-kernel learner and classification model shown in FIG. 3.
8 is a graph for explaining an experiment result performed by a method for detecting malicious software according to an exemplary embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the rights of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents, or substitutes to the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for illustrative purposes only and should not be interpreted as limiting. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance.

제1 또는 제2등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해서 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 실시예의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but the components should not be limited by terms. The terms are only for the purpose of distinguishing one component from other components, for example, without departing from the scope of rights according to the concept of the embodiment, the first component may be named as the second component, and similarly The second component may also be referred to as a first component.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in this application. Does not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are assigned to the same components regardless of the reference numerals, and redundant descriptions thereof will be omitted. In describing the embodiments, when it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the embodiments, the detailed description thereof will be omitted.

도 1은 일 실시예에 따른 악성 소프트웨어 탐지 방법을 수행하는 장치를 나타낸다.1 shows an apparatus for performing a malicious software detection method according to an exemplary embodiment.

악성 소프트웨어 탐지 장치(100)는 API 호출(API calls)에 의존할 뿐만 아니라, 이들 간의 관계를 분석하고, 탐지를 회피하는 공격자를 방지할 수 있도록 상위 수준의 의미(higher-level semantics, 또는 상위 수준의 의미 체계)를 생성하여 새로운 악성 소프트웨어 탐지 방법을 수행할 수 있다. 이를 통해, 악성 소프트웨어 탐지 장치(100)는 Android 악성 코드, Window 악성 코드 등 다양한 운영 체제 내의 악성 코드를 탐지할 수 있다.The malicious software detection device 100 not only relies on API calls, but also analyzes the relationship between them, and provides higher-level semantics (or higher-level semantics) to prevent an attacker from evading detection. Semantics) can be created to perform a new malicious software detection method. Through this, the malicious software detection device 100 may detect malicious codes in various operating systems such as Android malicious codes and Window malicious codes.

악성 소프트웨어 탐지 장치(100)는 소프트웨어와 관련 API들 간의 풍부한 관계를 통해 이기종 정보 네트워크(HIN(heterogeneous information network))를 구축하고, 다음에 메타 경로(meta-path) 기반 방법을 수행하여 소프트웨어 및 API들에 대한 의미적 관련성(semantic relevance)을 분석할 수 있다.The malicious software detection device 100 builds a heterogeneous information network (HIN) through rich relationships between software and related APIs, and then performs a meta-path-based method to provide software and APIs. Can analyze semantic relevance.

악성 소프트웨어 탐지 장치(100)는 각 메타 경로를 사용하여 소프트웨어(예를 들어, PE 파일들) 간의 유사성을 계산하고, 탐지 모델을 구성하기 위해 MKL(Multi-kernel Learning)을 사용하여 서로 다른 유사성을 집계함으로써, 탐지 모델을 학습시켰다.The malicious software detection device 100 calculates similarity between software (eg, PE files) using each meta path, and uses MKL (Multi-kernel Learning) to construct a detection model to achieve different similarities. By counting, the detection model was trained.

악성 소프트웨어 탐지 장치(100)는 상대적으로 높은 탐지율(high detection rate)과 낮은 오 탐지율(low false detection rate)을 획득할 수 있다.The malicious software detection apparatus 100 may obtain a relatively high detection rate and a low false detection rate.

도 2는 도 1에 도시된 악성 소프트웨어 탐지 장치의 개략적인 블록도이다.FIG. 2 is a schematic block diagram of a malicious software detection apparatus shown in FIG. 1.

악성 소프트웨어 탐지 장치(100)는 수신기(200), 컨트롤러(300), 및 메모리(400)를 포함한다.The malicious software detection device 100 includes a receiver 200, a controller 300, and a memory 400.

메모리(400)는 컨트롤러(300)에 의해 실행가능한 인스트럭션들(또는 프로그램을 저장할 수 있다. 예를 들어, 인스트럭션들은 컨트롤러(300)에 포함된 각 구성(도 3에 도시된 310 내지 370)의 동작을 실행하기 위한 인스트럭션들을 포함할 수 있다.The memory 400 may store instructions (or programs) executable by the controller 300. For example, the instructions are the operation of each component (310 to 370 shown in FIG. 3) included in the controller 300. It may contain instructions to execute.

수신기(200)는 PE 파일들을 수신할 수 있다.Receiver 200 may receive PE files.

컨트롤러(300)는 악성 소프트웨어 탐지 장치(100)의 전반적인 동작을 제어할 수 있다. 컨트롤러(300)는 수신기(200)로부터 수신된 PE 파일들이 악성 소프트웨어에 해당하는지 탐지할 수 있다.The controller 300 may control the overall operation of the malicious software detection device 100. The controller 300 may detect whether PE files received from the receiver 200 correspond to malicious software.

컨트롤러(300)는 PE 파일들을 분석하고, 동일한 패키지 이름을 사용하거나 같은 속성 값(property value) 등을 포함하여 이들 사이의 관계를 더 자세히 분석할 수 있다. 컨트롤러(300)는 API들과 PE 파일들 간의 관계들 및 PE 파일들 자체 간의 다양한 유형 관계들을 통해 더 높은 수준의 의미 분석을 수행할 수 있다. The controller 300 may analyze the PE files and analyze the relationship between them in more detail, including the same package name or the same property value. The controller 300 may perform a higher level of semantic analysis through relationships between APIs and PE files and various type relationships between PE files themselves.

컨트롤러(300)는 관계들의 풍부한 의미를 표현하기 위해, PE 파일들과 API들을 표현하는 구조화된 이기종 정보 네트워크(HIN) 표현을 생성할 수 있다. 그리고, 컨트롤러(300)는 메타 패스(meta-path)를 사용하여 더 높은 수준의 의미를 통합하여 PE 파일들의 의미 관련성을 구축할 수 있다.The controller 300 may generate a structured heterogeneous information network (HIN) representation representing PE files and APIs in order to express the rich meaning of the relationships. In addition, the controller 300 may build semantic relevance of PE files by integrating a higher level of meaning using a meta-path.

컨트롤러(300)는 이러한 방식으로 동일한 API들을 사용하는지 여부를 계산할 수 있을 뿐만 아니라 동일한 패키지와 같은 사용 패턴들(usage patterns)이 유사한 지 여부를 계산하여 PE 파일들 간의 유사성을 계산할 수 있다. 이때, 동일한 두 PE 파일들 간의 유사성을 설명하는 경로가 다르기 때문에, 컨트롤러(300)는 다중 커널 학습 알고리즘(Multi-kernel Learning algorithms)을 사용하여 서로 다른 유사성의 가중치를 자동으로 데이터로 학습할 수 있다.The controller 300 may calculate whether or not the same APIs are used in this manner, and may calculate similarity between PE files by calculating whether usage patterns such as the same package are similar. At this time, since the paths for explaining the similarity between the same two PE files are different, the controller 300 can automatically learn the weights of different similarities as data using multi-kernel learning algorithms. .

컨트롤러(300)는 커널 학습 알고리즘을 통해 학습된 메타 경로를 이용하여 PE 파일들이 악성 소프트웨어에 해당하는지 탐지할 수 있다.The controller 300 may detect whether PE files correspond to malicious software by using a meta path learned through a kernel learning algorithm.

도 3은 도 2에 도시된 컨트롤러의 개략적인 블록도이다.3 is a schematic block diagram of the controller shown in FIG. 2.

컨트롤러(300)는 PE 파일 분석기(PE File Analyzer; 310), HIN 생성기(HIN Constructor; 330), 다중 커널 학습기(Multikernel Learner; 350), 분류 모델(Classification Model; 370)를 포함한다.The controller 300 includes a PE File Analyzer 310, a HIN Constructor 330, a Multikernel Learner 350, and a Classification Model 370.

PE 파일 분석기(310)는 모든 실행 파일(PE 파일들)의 PE 테이블을 파싱하여(또는 구문 분석하여), 각 DLL 내부의 모든 PE 헤더 정보(PE header information), DLL 이름(DLL names), Opcode 시퀀스(Opcode sequence) 및 API 함수들(API functions)를 특성들(rfeatures)로 추출할 수 있다.The PE file analyzer 310 parses (or parses) the PE table of all executable files (PE files), and provides all PE header information, DLL names, and Opcodes in each DLL. It is possible to extract the sequence (Opcode sequence) and API functions (API functions) as features (rfeatures).

HIN 생성기(330)는 추출한 특성들을 기반으로 HIN을 구성할 수 있다. HIN 생성기(330)는 PE 파일들과 추출된 API 호출 사이의 연결을 먼저 구축하고, 이러한 API 호출 간의 관계 유형 및 PE 파일들과 PE 파일 정보 간의 연결을 정의할 수 있다.The HIN generator 330 may configure the HIN based on the extracted characteristics. The HIN generator 330 may first establish a connection between the PE files and the extracted API call, and define a relationship type between these API calls and a connection between the PE files and PE file information.

그런 다음 서로 다른 객체 유형들(different object types) 간의 인접 행렬들(adjacency matrices)이 생성되고, 다른 메타 경로들의 가환 행렬들(commuting matrices)이 열거되고 작성될 수 있다. Then adjacency matrices between different object types are created, and commuting matrices of different meta paths can be enumerated and written.

다중 커널 학습기(350)는 HIN의 가환 행렬들(commuting matrices)이 주어지면 SVM(Support Vector Machines)의 커널을 빌드할 수 있다. 다중 커널 학습기(350)는 표준 다중 커널 학습을 사용하여 서로 다른 메타 경로의 가중치를 최적화할 수 있다. 메타 경로 가중치가 주어지면, 모든 가환 행렬들(commuting matrices)을 결합하여 보다 강력한 멀웨어 탐지 커널을 공식화할 수 있다.The multi-kernel learner 350 may build a kernel of Support Vector Machines (SVM) given commuting matrices of HIN. The multi-kernel learner 350 may optimize weights of different meta paths using standard multi-kernel learning. Given the meta path weights, we can formulate a more powerful malware detection kernel by combining all commuting matrices.

분류 모델(370)은 악성 소프트웨어 탐지기라고도 할 수 있다. 새로 수집된 알 수 없는 소프트웨어마다 PE 파일 분석기(310)를 통해 이 소프트웨어 PE 파일을 먼저 구문 분석한 다음 PE 헤더 정보, DLL 이름, Opcode 시퀀스 및 API 호출을 추출하고 더 분석될 것이다. 이러한 추출된 특성들을 바탕으로 작성된 분류 모델(370)을 사용하여 PE 파일(예를 들어, 소프트웨어)는 양성 또는 악성 코드 중 하나로 분류될 수 있다.The classification model 370 may also be referred to as a malicious software detector. For each newly collected unknown software, this software PE file will be first parsed through the PE file analyzer 310, then the PE header information, DLL name, Opcode sequence and API call will be extracted and further analyzed. PE files (eg, software) may be classified as either benign or malicious code by using the classification model 370 created based on these extracted characteristics.

도 4는 도 3에 도시된 PE 파일 분석기의 개락적인 블록도를 나타내고, 도 5a 및 도 5b는 도 3에 도시된 PE 파일 분석기가 특성을 추출하는 동작의 일 예를 설명하기 위한 도면이다.FIG. 4 is a schematic block diagram of the PE file analyzer shown in FIG. 3, and FIGS. 5A and 5B are diagrams for explaining an example of an operation of extracting characteristics by the PE file analyzer shown in FIG. 3.

도 4 및 도 5에서는 PE 파일들을 추출된 특성들을 이용하여 어떻게 표현하는지에 대한 자세한 접근 방법 및 추출된 특성들을 기반으로 분류 문제를 해결하는 방법을 설명한다. In FIGS. 4 and 5, a detailed approach to how to represent PE files using extracted features and a method of solving a classification problem based on the extracted features will be described.

PE 파일 분석기(310)는 디컴파일러(decompiler; 313), 특성 추출기(Feature Extractor; 315), 분석기(317)를 포함할 수 있다. 디컴파일러(313)는 PE 파일들을 전처리하고(pre-process), 전처리된 PE 파일들을 디컴파일할 수 있다. 특성 추출기(315)는 PE 파일들로부터 특성들, 예를 들어 API 호출(API calls) 및 기타 관련 정보를 자동으로 추출할 수 있다.The PE file analyzer 310 may include a decompiler 313, a feature extractor 315, and an analyzer 317. The decompiler 313 may pre-process PE files and decompile the pre-processed PE files. Feature extractor 315 may automatically extract features, such as API calls and other related information, from PE files.

PE 파일 분석기(310)는 디컴파일러(decompiler; 313), 특성 추출기(Feature Extractor; 315) 및 분석기(317)를 포함할 수 있다. 디컴파일러(313)는 PE 파일들을 전처리하고(pre-process), 전처리된 PE 파일들을 디컴파일할 수 있다. The PE file analyzer 310 may include a decompiler 313, a feature extractor 315, and an analyzer 317. The decompiler 313 may pre-process PE files and decompile the pre-processed PE files.

특성 추출기(315)는 PE 파일들로부터 특성들, 예를 들어 API 호출(API calls) 및 기타 관련 정보를 자동으로 추출할 수 있다. Feature extractor 315 may automatically extract features, such as API calls and other related information, from PE files.

이때, API 호출(API calls)은 해당 API 호출(API calls)의 정적 실행 순서(static execution sequence)를 나타내는 글로벌(또는 전역) 정수 ID들(global integer IDs)의 그룹으로 변환될 수 있다. 마찬가지로, PE 파일 정보, Opcode 시퀀스 및 DLL도 해당 전역 정수 ID들(corresponding global integer IDs)로 변환될 수 있다.In this case, the API calls may be converted into a group of global (or global) global integer IDs representing the static execution sequence of the corresponding API calls. Similarly, PE file information, Opcode sequence, and DLL may also be converted into corresponding global integer IDs.

API 시퀀스를 추출할 때, 특성 추출기(315)는 먼저 컨트롤 흐름 그래프(control flows graph)를 생성해야 한다. 컨트롤 흐름(control flow)은 단일 어셈블러 명령어(single assembler instruction)로 구성된 실행 경로(run path)에 의한 프로그램 명령문(program statements)의 시퀀스이다. 각 함수는 기본 블록(basic block)이며 연속적인 어셈블리 명령어들(consecutive assembly instructions)의 모음(collection)이다. 컨트롤 흐름에 대한 항목(entry)은 기본 블록의 시작 명령어(start instruction)이며, 명령어가 끝난 후 이 기본 블록에서 점프한다. 어셈블리 코드는 여러 개의 하위 함수들(sub-functions)을 포함하고 있으며, 이러한 하위 함수들은 도 5a와 같이 전체 프로그램 컨트롤 흐름 그래프를 구성하기 위해 기본 블록 사이의 점프를 증가시킨다.When extracting the API sequence, the feature extractor 315 must first generate a control flows graph. A control flow is a sequence of program statements by a run path consisting of a single assembler instruction. Each function is a basic block and is a collection of consecutive assembly instructions. The entry to the control flow is the start instruction of the basic block, which jumps from this basic block after the instruction finishes. The assembly code includes a number of sub-functions, and these sub-functions increase jumps between basic blocks to construct an overall program control flow graph as shown in FIG. 5A.

어셈블리 명령어들(the assembly instructions)에서, 컨트롤 흐름 그래프의 역할에 따라 일반, jmp, jcc, 호출(call), 리턴(return), 시작(start)으로 나눌 수 있습니다. API 호출을 추출하기 위해 호출 명령어 노드, 주요 함수 항목 및 리턴 만 유지한다. 이와 같은 방법으로, 특성 추출기(315)는 도 5b와 같이 컨트롤 흐름 그래프의 단순화된 다이어그램을 얻는다. 마지막으로, 특성 추출기(315)는 API 시퀀스(API call 시퀀스)를 추출하기 위해 명령어의 순서에 따라 단순화된 컨트롤 흐름 그래프를 탐색한다.In the assembly instructions, it can be divided into general, jmp, jcc, call, return, and start depending on the role of the control flow graph. To extract API calls, we only keep the calling instruction node, the main function item, and the return. In this way, the feature extractor 315 obtains a simplified diagram of the control flow graph as shown in FIG. 5B. Finally, the feature extractor 315 searches the simplified control flow graph according to the order of instructions to extract an API sequence (API call sequence).

opcode 시퀀스의 추출은 API 시퀀스의 추출과 유사하다. opcode에 대한 컨트롤 흐름 그래프를 작성한 다음, 플로우 그래프를 탐색하여 가능한 모든 실행 순서들(execution sequences)을 추출한다. 특성 추출기(315)는 서브 함수에 포함된 자체 루프(self-loop)에 따라 루프에 대해서만 탐색하므로 PE 파일들의 opcode 시퀀스를 추출할 수 있다.The extraction of the opcode sequence is similar to the extraction of the API sequence. Build a control flow graph for the opcode, then traverse the flow graph to extract all possible execution sequences. Since the feature extractor 315 searches only for a loop according to a self-loop included in a sub function, it is possible to extract an opcode sequence of PE files.

분석기(317)는 추출된 특성들을 이용하여 행렬들을 생성할 수 있다. 행렬들은 PE 파일에 관한 행렬들로, PE 파일과 추출된 특성들 간의 관계를 표현하는 행렬들일 수 있다. 예를 들어, 관계는 PE 파일과 API 호출에 대한 제1 관계, PE 파일의 PE 헤더 정보의 속성값에 대한 제2 관계, API 호출이 속하는 패키지에 대한 제3 관계, PE 파일의 API 시퀀스에 대한 제4 관계, PE 파일의 Opcode 시퀀스에 대한 제5 관계를 포함할 수 있다.The analyzer 317 may generate matrices using the extracted features. The matrices are matrices related to the PE file, and may be matrices expressing a relationship between the PE file and the extracted features. For example, the relationship is the first relationship to the PE file and the API call, the second relationship to the attribute value of the PE header information of the PE file, the third relationship to the package to which the API call belongs, and the API sequence of the PE file. It may include a fourth relationship and a fifth relationship to the Opcode sequence of the PE file.

■ PE 헤더 정보■ PE header information

PE 헤더 정보는 이름, 크기, 오프셋, 유형(type) 등과 같은 중요한 정보를 포함한다. 상술한 정보를 속성(property)이라고 하며 이 값들을 속성 값(property value)이라고 한다. 같은 속성 값을 가진 두 개의 PE 파일들은 유사성을 가지고 있다. 이러한 종류의 관계

을 표현하기 위해, 분석기(317)는 속성값 행렬(property-value matrix)

을 생성한다. 여기서, 각 요소

는 속성들의 쌍(pair)이 같은 값인지를 나타낸다.PE header information includes important information such as name, size, offset, type, and so on. The above-described information is called a property, and these values are called a property value. Two PE files with the same attribute value have similarities. This kind of relationship

To represent, the analyzer 317 is a property-value matrix

Create Where, each element

Indicates whether a pair of attributes is the same value.

■ API 패키지■ API package

API 호출들은 PE 파일의 행동(behavior)을 나타내기 위해 사용될 수 있으며, 이들(API calls) 사이의 관계는 악성 소프트웨어 탐지에 중요한 정보를 암시할 수 있다. PE 파일의 모든 API 호출들은 동일하거나 다른 DLL(dynamic link library)에 속한다. 우리는 동일한 DLL에 속한 API 호출은 항상 동일한 의도를 나타낸다는 것을 발견했다. 동일한 DLL에있는 API는 같은 패키지에 속한다고 한다. API calls can be used to indicate the behavior of a PE file, and the relationship between them (API calls) can imply important information for malicious software detection. All API calls in the PE file belong to the same or different DLL (dynamic link library). We found that API calls belonging to the same DLL always show the same intent. APIs in the same DLL are said to belong to the same package.

제공된 함수에 따르면, Windows API는 7가지 카테고리로 분류될 수 있다. 7가지 카테고리는 기본 서비스(kernel32.dll, advapi32.dll 등), 그래픽 장치 인터페이스, 그래픽 사용자 인터페이스(user32.dll), 공용 대화 링크 라이브러리(common dialog links library), 유니버설 스페이스 링크 라이브러리(universal space link library), Windows 쉘(Windows shell), 웹 서비스 web services)을 포함한다. According to the provided functions, Windows APIs can be classified into 7 categories. The seven categories are basic services (kernel32.dll, advapi32.dll, etc.), graphical device interface, graphical user interface (user32.dll), common dialog links library, universal space link library. ), Windows shell, and web services.

예를 들어 "advapi32.DLL" DLL 내 API 호출들은 레지스트리 호출(registry calls)과 관련이 있다. API 호출들은 동일한 패키지에서 공동으로 나타나며 둘 사이의 강력한 관계를 나타낸다. For example, API calls in the "advapi32.DLL" DLL are related to registry calls. API calls appear jointly in the same package and represent a strong relationship between the two.

이러한 종류의 관계

를 나타내기 위해, 분석기(317)는 공통 패키지 행렬(co-package matrix)

을 생성한다. 여기서, 각 요소

는 API 호출들의 한 쌍이 같은 패키지에 있는지를 나타낸다. 예를 들어, 동일한 패키지 "advapi32.DLL"에서 두 개의 서로 다른 API 인 "RegDeleteKeyA"와 "RegQueryValueExA"가 두 개의 PE 파일들에서 호출된다. 이 두 API 간의 관계를 나타내는 행렬의 값은 1일 수 있다.This kind of relationship

To represent, the analyzer 317 is a co-package matrix

Create Where, each element

Indicates whether a pair of API calls are in the same package. For example, in the same package "advapi32.DLL", two different APIs "RegDeleteKeyA" and "RegQueryValueExA" are called on two PE files. The value of the matrix indicating the relationship between these two APIs may be 1.

■ API 시퀀스.■ API sequence.

PE 파일을 실행하는 동안, API 호출들의 호출 시퀀스(call sequence)는 중요한 관계를 나타낸다. Google은 API 호출 시퀀스 (행동) 데이터를 수집하고, 각 PE 파일 간의 관계를 나타내기 위해 API들 특성(APIs feature)의 2-시퀀스들(2-sequences)을 사용한다. 예를 들어 PE 파일 호출 API 시퀀스는 "RegCloseKey → NtClose → GetProcessHeap"이며 "RegCloseKey → NtClose"및 "NtClose → GetProcessHeap"의 두 가지 특성으로 정의되며, 이 특성 유형(feature type)은 API 시퀀스로 기록된다. 따라서 동일한 API 시퀀스를 가진 두 개의 PE 파일이 있다면, 우리는 이들이 몇 가지 유사점을 가질 수 있다고 생각한다. 이와 같은 종류의 관계

를 나타내기 위해, 분석기(317)는 API 시퀀스 행렬(API-sequence matrix)

을 생성한다. 여기서, 각 요소

은 속성들의 한 쌍(a pair of properties)이 동일한 API 시퀀스를 갖는지 여부를 나타낸다.During execution of the PE file, the call sequence of API calls represents an important relationship. Google collects API call sequence (behavioral) data, and uses 2-sequences of APIs feature to represent the relationship between each PE file. For example, the PE file call API sequence is "RegCloseKey → NtClose → GetProcessHeap" and is defined by two properties: "RegCloseKey → NtClose" and "NtClose → GetProcessHeap", and this feature type is recorded as an API sequence. So if there are two PE files with the same API sequence, we think they can have some similarities. This kind of relationship

To represent, the analyzer 317 is an API-sequence matrix

Create Where, each element

Indicates whether a pair of properties has the same API sequence.

■ Opcode 시퀀스.■ Opcode sequence.

맬웨어 분석에서, opcode 특성(opcode feature)은 실행 프로세스의 행동 특성(behavioral characteristics)을 고려할 수 있으며, 맬웨어를 더 자세히 설명할 수 있다. 각 PE 파일 간의 관계를 나타내기 위해 Opcode 특성들의 2-시퀀스를 사용한다. 예를 들어, “push →push” 및 “push →call”의 두 가지 특성들로 정의되는 “push →push →call”라는 PE 파일 Opcode 시퀀스가 있다. 이 특성 유형은 Op-sequence로 기록된다. 따라서 동일한 Op-sequence를 가진 두 개의 PE 파일이 있다면, 그것들이 약간의 유사점을 가지고 있다고 간주할 수 있다. 이러한 종류의 관계

를 표현하기 위해, 분석기(317)는 Op-sequence 행렬

을 생성한다. 여기서, 각 요소

는 속성들의 한 쌍이 동일한 Op-sequence를 갖는지 여부를 나타낸다.In malware analysis, the opcode feature can take into account the behavioral characteristics of the executing process, and can describe the malware in more detail. We use a two-sequence of Opcode features to represent the relationship between each PE file. For example, there is a PE file opcode sequence called “push → push → call” that is defined by two properties: “push → push” and “push → call”. This characteristic type is recorded as Op-sequence. So if you have two PE files with the same Op-sequence, you can consider them to have some similarities. This kind of relationship

To represent, the analyzer 317 is an Op-sequence matrix

Create Where, each element

Indicates whether a pair of attributes have the same Op-sequence.

서로 다른 관계와 관계 행렬 내 각 요소에 대한 설명 요약은 표 1와 같을 수 있다.A summary of the different relationships and descriptions of each element in the relationship matrix may be shown in Table 1.

도 6은 도 3에 도시된 HIN 생성기의 동작을 설명하기 위한 도면이다.6 is a diagram for explaining the operation of the HIN generator shown in FIG. 3.

도 6에서는 API의 풍부한 관계 유형들(relationship types)을 더 잘 분석하기 위해 위에서 추출한 특성들을 사용하여 HIN를 사용하여 PE 파일들을 나타내는 방법을 설명한다.In FIG. 6, in order to better analyze the rich relationship types of the API, a method of representing PE files using HIN using the features extracted above is described.

HIN 생성기(330)는 행렬들에 기초하여 HIN을 구성할 수 있다. HIN은 링크 유형 매핑 함수(link type mapping function)

및 객체 유형 매핑 함수(object type mapping function)

를 갖는 그래프

이다. 여기서, 각 객체

는 특정 객체 유형

에 속하고, 각 링크

는 특정 관계

에 속한다. 객체 유형들의 수

또는 링크 유형들의 수

인 경우, 이러한 네트워크를 HIN이라고하는 이기종 정보 네트워크라고 한다.The HIN generator 330 may construct a HIN based on matrices. HIN is a link type mapping function

And object type mapping function

Graph with

to be. Where, each object

Is a specific object type

Belong to, and each link

Is a specific relationship

Belongs to. Number of object types

Or number of link types

In the case of, this network is called a heterogeneous information network called HIN.

악성 소프트웨어 탐지를 위한 시스템에서는, 5개의 객체 유형들을 가지고 있다. 5개의 객체 유형들은 PE 파일, PE 헤더 정보, Opcode 시퀀스, API 호출 및 API 시퀀스를 포함한다. 4개의 관계 유형들이 있다. 4개의 관계 유형들은 API 호출 및 PE 헤더 정보를 포함하는 PE 파일, 동일한 패키지 내의 API 호출들, 및 동일한 API 시퀀스를 갖는 PE 파일, 및 동일한 Opcode 시퀀스를 갖는 PE 파일을 포함한다. 서로 다른 객체 유형과 서로 다른 관계 유형이 유사한 관계들의 풍부한 네트워크를 구성하므로, HIN 생성기(330)는 HIN의 메타 경로 접근법을 사용하여 객체들 간의 상위 레벨 의미 관계를 공식화할 수 있다.In a system for detecting malicious software, there are five object types. The five object types include PE file, PE header information, Opcode sequence, API call and API sequence. There are four types of relationships. The four relationship types include a PE file containing API call and PE header information, API calls within the same package, and a PE file with the same API sequence, and a PE file with the same Opcode sequence. Since different object types and different relationship types constitute a rich network of similar relationships, the HIN generator 330 can formulate high-level semantic relationships between objects using HIN's meta-path approach.

메타 경로

는 네트워크 스키마(network schema)

의 그래프 상의 경로이다. 그 형식은

와 같이 기록되는데, 유형

와 유형

간의 복합 관계(composite relationship)

를 정의한다. 여기서 "

"는 관계(relation)에 대한 복합 연산(complex operation)을 나타낸다. PE 파일의 경우, 일반적인 메타 경로는

이다. 이것은 HIN 내 동일한 API를 통해 두 개의 서로 다른 PE 파일을 연결할 수 있음을 의미한다. HIN 생성기(330)는 PathSim 방법(PathSim method)을 사용하여 메타 경로를 통해 객체들의 유사성을 계산할 수 있다.Meta path

Is the network schema

Is the path on the graph of. Its format is

It is recorded as, type

And type

Composite relationship

Defines here "

"Denotes a complex operation on a relationship. For PE files, a typical meta path is

to be. This means that you can link two different PE files through the same API in HIN. The HIN generator 330 may calculate the similarity of objects through the meta path using the PathSim method.

PathSim 방법(PathSim method)은 메타 경로 기반 유사성 측정이다. 대칭적인 메타 경로

가 주어지면, 동일한 유형의 객체 와 y에 대한 PathSim 정의(PathSim definition)는 다음과 같다.The PathSim method is a meta-path-based similarity measurement. Symmetric meta path

Given is given, the PathSim definition for an object of the same type and y is:

여기서,

는 x와 y 사이의 경로 인스턴스(path instance)이고,

는 와 x 사이의 경로 인스턴스이고,

는 y와 y 사이의 경로 인스턴스(path instance)이다.here,

Is the path instance between x and y,

Is the path instance between and x,

Is the path instance between y and y.

이것은 메타 경로

가 주어진다면,

는 두 부분으로 정의될 수 있다. (1) 메타 경로를 따르는 그들 사이의 경로 수로 정의된 연결성이고 (2) 가시성의 균형이다. 가시성은 그들 사이의 경로 인스턴스들의 수로 정의된다. 경로 인스턴스의 가중치로 경로 인스턴스의 멀티플 발생을 계산할 수 있다. 경로 인스턴스의 가중치는 경로 인스턴스 내 모든 경로들의 가중치 곱이다.This is the meta path

Is given,

Can be defined in two parts. It is (1) connectivity defined by the number of paths between them along a meta path, and (2) a balance of visibility. Visibility is defined as the number of instances of the path between them. Multiple occurrences of route instances can be calculated with the weight of the route instance. The weight of the route instance is the product of the weights of all routes in the route instance.

네트워크

와 네트워크 스키마

가 주어진 경우, 메타 경로

에 대한 맞바꿈 행렬은

로 정의된다. 여기서,

은 유형

와 유형

사이의 인접 행렬(adjacency matrix)이다.

는 메타 경로

에서 객체

과 객체

사이의 경로 인스턴스의 수를 나타낸다.network

And network schema

If is given, the meta path

The replacement matrix for is

Is defined as here,

Silver type

And type

It is the adjacency matrix between.

Is the meta path

Object in

And object

Represents the number of instances of the route between.

예를 들어, PE 파일들과 API 호출들 사이의 인접 행렬은

이다. 그러면 메타 경로

을 사용하여 계산된 PE 파일들의 맞바꿈 행렬은

, 즉,

이다.

를 행렬

의

번째 로우로 나타내면, PE 파일

와

의 유사도는

로 주어 지는데, 이는 단순히 두 특징 벡터의 내적(dot product)이다. 보다 복잡한 유사성은 메타 경로를 기반으로 한 맞바꿈 행렬에 의해 정의될 수 있다. 즉, 내부 API 호출만 고려하지 않고, 두 앱(예를 들어, PE 파일들) 간의 유사성을 계산한다.For example, the adjacency matrix between PE files and API calls is

to be. Then the meta path

The swap matrix of PE files calculated using

, In other words,

to be.

Matrix

of

In the second row, the PE file

Wow

The similarity of

It is given as, which is simply the dot product of two feature vectors. More complex similarities can be defined by swap matrices based on meta paths. In other words, it does not consider only internal API calls, but calculates the similarity between two apps (eg, PE files).

도 7은 도 3에 도시된 다중 커널 학습기 및 분류 모델에 대해 설명하기 위한 도면이다.FIG. 7 is a diagram illustrating a multi-kernel learner and classification model shown in FIG. 3.

여러 유형의 엔티티들과 관계들이 있는 네트워크 스키마가 주어지면, 많은 메타 경로들을 열거할 수 있다. 따라서 직관적인 방법은 서로 다른 메타 경로들을 결합하는 것이다.Given a network schema with different types of entities and relationships, you can enumerate many meta paths. So the intuitive way is to combine different meta paths.

다중 커널 학습기(350)는 PE 파일들을 분류할 때, 다중 커널 학습 알고리즘을 사용하여 서로 다른 유사성들을 자동으로 통합하고, 각 메타 경로의 가중치를 결정할 수 있다. 개의 메타 경로

가 있다고 가정한다. 다중 커널 학습기(350)는 개의 메타 경로에 대응하는 맞바꿈 행렬

을 계산할 수 있다. 여기서

는 커널로 간주된다. 맞바꿈 행렬이 PSD(positive semi-definite)이 아닌 경우, 맞바꿈 행렬의 음의 고유값들(negative eigenvalues)을 제거한다. 다음과 같이, 다중 커널 학습기(350)는 커널들의 선형 결합을 사용하여 새 커널을 형성할 수 있다When classifying PE files, the multi-kernel learner 350 may automatically integrate different similarities using a multi-kernel learning algorithm and determine a weight of each meta path. Meta paths

Suppose there is. The multi-kernel learner 350 is a swap matrix corresponding to meta paths

Can be calculated. here

Is considered a kernel. If the swap matrix is not a positive semi-definite (PSD), negative eigenvalues of the swap matrix are removed. As follows, the multi-kernel learner 350 may form a new kernel using a linear combination of kernels.

여기서,

이고,

을 만족한다. here,

ego,

Is satisfied.

각 메타 경로의 가중치를 배우기 위해, 라벨링된 데이터(labeled data)의 집합

을 가정한다. 여기서,

은 PE 파일 (여기서,

을 ID로 간주할 수 있음)이고,

은 라벨(label)이다. 그런 다음 다중 커널 학습기(350)는 다음과 같은 목적 함수(objective function)를 갖는 p-norm 다중 커널 학습 프레임워크(-norm Multi-kernel Learning framework)를 사용하여 다음과 같은 목적 함수를 사용하여 파라미터들을 학습할 수 있다.To learn the weight of each meta path, a set of labeled data

Assume here,

Is the PE file (where,

Can be considered an ID),

Is the label. Then, the multi-kernel learner 350 uses the p-norm multi-kernel learning framework having the following objective function to determine the parameters using the following objective function. You can learn.

여기서, 각 커널에 대해 파라미터 벡터

를 학습한다. 각 데이터

에 대해, 슬랙 파라미터

는 오분류(misclassification)를 혀용하기 위해 도입되었다.

는 커널을 정의하는 Hilbert 공간에서의 특징들(features)의 비선형 맵핑이다. 여기서,

이다. 그런 다음 표현 정리(representation theorem)를 적용하면,

을 얻을 수 있다.

는 이중 공식(dual formulation)을 사용하여 해결할 수 있으며, 0이 아닌

들은 지원 벡터들(support vector)로 이어진다.Here, the parameter vector for each kernel

To learn. Each data

About, Slack parameter

Was introduced to allow for misclassification.

Is a nonlinear mapping of features in Hilbert's space defining the kernel. here,

to be. Then apply the representation theorem,

Can be obtained.

Can be solved using a dual formulation, non-zero

They lead to support vectors.

다중 커널 학습 프레임 워크에서,

이외의 다른 매개 변수 집합은

이다. 여기서 -norm

은

들의 최적화를 정규화하는 데 사용된다. 경험적으로 문제에 2- 놈(2-norm)을 적용할 수 있다. 최적화 후, 가중치

는 커널들로 사용되는 메타 경로들의 중요성을 나타내기 위해 최적화된다. 새로운 PE 파일 x가 오는 경우, PE 파일이 악의적인지 여부를 평가하는 데

이 사용된다.In a multi-kernel learning framework,

Any other set of parameters

to be. Where -norm

silver

Are used to normalize the optimization of You can empirically apply a 2-norm to your problem. After optimization, weight

Is optimized to indicate the importance of meta paths used by kernels. When a new PE file x comes, it is used to evaluate whether the PE file is malicious or not.

Is used.

도 8은 일 실시예에 따른 악성 소프트웨어 탐지 방법에 의해 수행된 실험 결과를 설명하기 위한 그래프이다.8 is a graph for explaining an experiment result performed by a method for detecting malicious software according to an exemplary embodiment.

1000 개의 악성 샘플(malicious samples)과 1000 개의 양성 샘플(benign samples)을 선택하여 실험을 수행했다. 악성 샘플에는 웜(Worm), 백도어(Backdoor), 트로이 목마(Trojan), 루트 킷(Rootkit) 등이 포함되었으며, 양성 샘플에는 뷰어(viewers), 게임(games), 브라우저(browsers,) 등이 포함되었다. 제안된 방법의 성능을 평가하기 위해, 정확도, 참 긍정 비율(TPR(true positive ratio)), 참 부정 비율(TNR(true negative ratio)), 거짓 긍정 비율(FPR(false positive ratio)), 거짓 부정 비율(FNR(false negative ratio))을 측정하였고, 이러한 관련 측정은 표 2와 같다.The experiment was conducted by selecting 1000 malicious samples and 1000 benign samples. Malicious samples included Worm, Backdoor, Trojan, and Rootkit, and positive samples included viewers, games, browsers, etc. Became. To evaluate the performance of the proposed method, accuracy, true positive ratio (TPR), true negative ratio (TNR), false positive ratio (FPR), false negative. The ratio (false negative ratio (FNR)) was measured, and these related measurements are shown in Table 2.

이 실험 세트에서는 악의적인 샘플과 양성 샘플을 무작위로 5개의 하위 집합, 4개의 하위 집합으로 나누어 분류 모델(또는 탐지 모델)을 구성했다(이중 800 개는 양성 샘플, 나머지 800 개는 악성 샘플). 나머지 1개의 하위 집합은 모델 테스트(그 중 200 개는 양성으로 분류되고 200개는 악성으로 분류됩니다))에 사용되었다.In this set of experiments, a classification model (or detection model) was formed by randomly dividing malicious and positive samples into 5 subsets and 4 subsets (of which 800 were positive samples and the remaining 800 were malignant samples). The remaining 1 subset was used in the model tests (200 of which are classified as benign and 200 as malignant).

실험 세트에서 특성들을 추출하고 이들 간의 관계를 생성한 다음 SVM(Support Vector Machine)을 사용하여 5개의 메타 경로를 구성하고 탐지 성능을 비교했다. 또한, 다중 커널 학습을 적용하여 실험하기 위해 모든 메타 경로(SS^T, SVS^T, SPS^T, SAS^T, SOS^T)를 사용했다. 이들 실험 결과를 표 3 및 도 8에 도시된 바와 같다.After extracting features from the experimental set and creating relationships between them, we constructed five meta-paths using a Support Vector Machine (SVM) and compared the detection performance. In addition, all meta paths (SS ^T , SVS ^T , SPS ^T , SAS ^T , SOS ^T ) were used to experiment by applying multi-kernel learning. These experimental results are as shown in Table 3 and FIG. 8.

실험에서, 단일 메타 경로 생성 모델을 적용하고 API 시퀀스의 메타 경로를 기반으로 한 실험 결과가 가장 우수함을 확인할 수 있다. 95.5 %의 진정한 탐지율과 3.5 %의 오 탐지율에 해당한다. In the experiment, it can be seen that the single meta-path generation model is applied and the experimental result based on the meta-path of the API sequence is the best. This corresponds to a true detection rate of 95.5% and a false detection rate of 3.5%.

반대로, PE 헤더 정보 속성 값의 메타 경로를 기반으로 한 실험 결과는 낮다. 84.5 %의 진정한 탐지율과 16.5 %의 오 탐지율에 해당한다. PE 헤더 정보의 속성에 많은 신뢰할 수 없는 속성 값이 있다고 볼 수 있다. 이러한 신뢰할 수 없는 속성 값은 실험 결과를 악화시킬 수 있다.Conversely, the experimental results based on the meta-path of the PE header information attribute value are low. This corresponds to a true detection rate of 84.5% and a false detection rate of 16.5%. It can be seen that there are many untrusted property values in the properties of the PE header information. These unreliable attribute values can exacerbate the experimental results.

MKL(Multi-kernel Learning)을 사용하여 모든 메타 경로 구성 탐지 모델을 결합하여 최상의 실험 결과를 얻었으며 실제 탐지율은 98.5 %이고 오 탐지율은 2 %이며, 정확도는 98.25%에 해당한다. 이 실험 결과는 제안된 방법의 성능이 효과적임을 보여준다.The best experimental results were obtained by combining all meta-path configuration detection models using MKL (Multi-kernel Learning), and the actual detection rate is 98.5%, the false detection rate is 2%, and the accuracy is 98.25%. This experimental result shows that the performance of the proposed method is effective.

상술한 실험 결과를 바탕으로, 실시예에 따른 악성 소프트웨어 탐지 방법은 PE 헤더 정보의 속성을 필터링하고 신뢰할 수 없는 속성을 제거하며 분류 모델 성능을 향상시키는 더 중요한 특성을 추출하여 이용할 수 있다. 또한, 실시예에 따른 악성 소프트웨어 탐지 방법은 메타 경로의 길이를 확장하고 메타 경로(예를 들어, SOAO^TS^T, SVPV^TS^T 등) 간의 연결을 더 늘릴 수도 있다.Based on the above-described experimental results, the malicious software detection method according to the embodiment may filter attributes of PE header information, remove unreliable attributes, and extract and use more important characteristics that improve classification model performance. In addition, the malicious software detection method according to the embodiment may extend the length of the meta path and further increase the connection between the meta paths (eg, SOAO ^T S ^T , SVPV ^T S ^T, etc.).

도 1 내지 도 8을 참조하여, 이종 정보 망 (HIN) 기반의 새로운 악성 소프트웨어 탐지 방법을 설명했다. 실시예들은 PE 파일을 분석하여 PE 헤더 정보, API 호출, DLL 및 opcode를 피쳐로 추출하고 속성들 간의 관계에 대한 HIN를 구축한 다음, 메타 경로 기반 방법을 사용하여 해당 PE 파일의 의미 관련성을 설명할 수 있다.A new method of detecting malicious software based on a heterogeneous information network (HIN) has been described with reference to FIGS. 1 to 8. The examples analyze the PE file to extract PE header information, API calls, DLLs and opcodes as features, build a HIN for the relationship between attributes, and then describe the semantic relevance of the PE file using a meta path-based method. can do.

탐지 시스템(예를 들어, 분류 모델)을 구축하기 위해, 실시예들은 각 메타 경로를 적용하여 PE 파일들 간의 유사성을 계산하고, 다중 커널 학습을 사용하여 서로 다른 유사점을 집계할 수 있다.In order to build a detection system (for example, a classification model), embodiments may calculate similarities between PE files by applying each meta path, and aggregate different similarities using multi-kernel learning.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. , Or may be permanently or temporarily embodyed in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by the limited drawings, a person of ordinary skill in the art can apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments and claims and equivalents fall within the scope of the following claims.

Claims

In the malicious software detection method of a malicious software detection device,
Extracting, by the malicious software detection device, characteristics from Portable Executable (PE) files;
Generating, by the malicious software detection device, a heterogeneous information network (HIN) for a relationship between the PE files and the characteristics; And
Detecting, by the malicious software detection device, whether the PE files are malicious software using a meta path on the HIN
Including,
The above relationship is,
The first relationship to the PE file and the API call;
A second relationship to an attribute value of PE header information of a PE file;
A third relationship to the package to which the API call belongs;
A fourth relationship to the API sequence of the PE file; And
The fifth relationship to the opcode sequence of the PE file
Malicious software detection method comprising a.

The method of claim 1,
The above characteristics include PE header information, API call, DLL and Opcode sequence.

The method of claim 1,
The detecting step,
Detecting whether the PE files are malicious software by calculating similarity between the PE files using the meta path on the HIN
Malicious software detection method comprising a.

delete

The method of claim 1,
The meta path is,
A first meta path constructed through a first matrix expressing the first relationship;
A second meta path constructed through a second matrix expressing the second relationship;
A third meta-path constructed through a third matrix expressing the third relationship;
A fourth meta-path configured through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection method comprising a.

The method of claim 5,
The meta path is,
Malicious software detection method further comprising a meta-path linearly combined by optimizing the first meta-path, the second meta-path, the third meta-path, the fourth meta-path, and the fifth meta-path through multi-kernel learning .

The method of claim 1,
The meta path is,
A first meta path constructed through a first matrix expressing the first relationship;
A second meta path constructed through a second matrix expressing the second relationship;
A third meta-path constructed through a third matrix expressing the third relationship;
A fourth meta-path configured through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection method, which is a linearly combined meta-path by optimizing through multi-kernel learning.

A receiver for receiving Portable Executable (PE) files; And
A controller that extracts characteristics from the PE files, creates a heterogeneous information network (HIN) for the relationship between the PE files and the characteristics, and detects whether the PE files are malicious software using a meta path on the HIN
Including,
The above relationship is,
The first relationship to the PE file and the API call;
A second relationship to an attribute value of PE header information of a PE file;
A third relationship to the package to which the API call belongs;
A fourth relationship to the API sequence of the PE file; And
The fifth relationship to the opcode sequence of the PE file
Malicious software detection device comprising a.

The method of claim 8,
The characteristics of the malicious software detection device including PE header information, API call, DLL and Opcode sequence.

The method of claim 8,
The controller,
A malicious software detection device that detects whether the PE files are malicious software by calculating the similarity between the PE files using the meta path on the HIN.

delete

The method of claim 8,
The meta path is,
A first meta path constructed through a first matrix expressing the first relationship;
A second meta path constructed through a second matrix expressing the second relationship;
A third meta-path constructed through a third matrix expressing the third relationship;
A fourth meta-path configured through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection device comprising a.

The method of claim 12,
The meta path is,
Malicious software detection apparatus further comprising a meta-path linearly combined by optimizing the first meta-path, the second meta-path, the third meta-path, the fourth meta-path, and the fifth meta-path through multi-kernel learning .

The method of claim 8,
The meta path is,
A first meta path constructed through a first matrix expressing the first relationship;
A second meta path constructed through a second matrix expressing the second relationship;
A third meta-path constructed through a third matrix expressing the third relationship;
A fourth meta-path configured through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
A malicious software detection device that is a linearly combined meta-path by optimizing through multi-kernel learning.