[go: up one dir, main page]

CN119740232A - A cross-architecture malicious code detection method and device - Google Patents

A cross-architecture malicious code detection method and device Download PDF

Info

Publication number
CN119740232A
CN119740232A CN202411808808.2A CN202411808808A CN119740232A CN 119740232 A CN119740232 A CN 119740232A CN 202411808808 A CN202411808808 A CN 202411808808A CN 119740232 A CN119740232 A CN 119740232A
Authority
CN
China
Prior art keywords
architecture
operation code
instruction
malicious
arm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411808808.2A
Other languages
Chinese (zh)
Inventor
刘义伟
杨宇杰
尹鹏
范星宇
涂腾飞
王森淼
闫伟浩
张琳
葛一正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER
Beijing University of Posts and Telecommunications
Original Assignee
MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER, Beijing University of Posts and Telecommunications filed Critical MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER
Priority to CN202411808808.2A priority Critical patent/CN119740232A/en
Publication of CN119740232A publication Critical patent/CN119740232A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

一种跨架构恶意代码检测方法及装置,该方法通过提取不同架构下的汇编操作码并建立映射关系,使得在跨架构分析中能够准确理解指令的含义,有助于消除架构差异带来的理解困难;将映射后的指令输入到3‑gram模型中,生成固定长度的嵌入向量,计算目标文件的嵌入向量和恶意样本的嵌入向量之间的欧氏距离,得到目标文件与恶意样本之间的相似度,从而在采用专家知识的基础上,对剩余操作码进行动态分析,得到完整指令映射关系表,能够有效地利用映射后的指令操作码进行恶意代码检测,有助于提高不同架构下的恶意代码检测的准确性,显著提高跨架构汇编指令分析的效率和准确性,提高检测效率,降低时间成本。

A cross-architecture malicious code detection method and device, the method extracts assembly opcodes under different architectures and establishes mapping relationships, so that the meaning of instructions can be accurately understood in cross-architecture analysis, which helps to eliminate the difficulty of understanding caused by architecture differences; the mapped instructions are input into a 3-gram model to generate an embedding vector of a fixed length, the Euclidean distance between the embedding vector of a target file and the embedding vector of a malicious sample is calculated, and the similarity between the target file and the malicious sample is obtained, so that the remaining opcodes are dynamically analyzed on the basis of expert knowledge to obtain a complete instruction mapping relationship table, and the mapped instruction opcodes can be effectively used for malicious code detection, which helps to improve the accuracy of malicious code detection under different architectures, significantly improve the efficiency and accuracy of cross-architecture assembly instruction analysis, improve detection efficiency, and reduce time cost.

Description

Cross-architecture malicious code detection method and device
Technical Field
The invention belongs to the technical field of software security, and particularly relates to a cross-architecture malicious code detection method and device.
Background
Currently, each architecture, whether x86 architecture or ARM architecture, has its unique instruction set and execution scheme. The method has the advantages that the assembly instructions of different architectures are accurately analyzed and understood, and the method has important significance for optimizing system performance, improving safety and improving software development efficiency.
At present, assembly instruction analysis for cross-architecture relies mainly on both static and dynamic approaches. Static analysis is fast but has limited accuracy. Dynamic analysis has some impact on system performance and cannot cover all cases. Therefore, how to balance accuracy and efficiency in cross-architecture assembly instruction analysis has practical application significance.
Disclosure of Invention
In view of the above, the present invention is directed to a method and a device for detecting cross-architecture malicious codes, so as to solve or partially solve the technical problems mentioned in the background art.
Based on the above object, in a first aspect, the present invention provides a method for detecting cross-architecture malicious code, including:
Acquiring assembly codes in an x86 architecture and assembly codes in an ARM architecture, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, and analyzing the assembly codes in the ARM architecture to extract the operation code instructions in the ARM architecture;
Performing one-to-one correspondence on the operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, and establishing an operation code instruction mapping relation between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture to obtain an operation code instruction mapping relation table;
for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, tracking and analyzing an operation code instruction execution result by adopting a dynamic analysis tool, and perfecting the operation code instruction mapping relation table according to the dynamic analysis result of the operation code instruction;
Inputting an operation code instruction establishing a mapping relation into a 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, and counting the occurrence frequency of each sequence to generate an embedded vector with a fixed length;
Calculating Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample to obtain similarity between the target file and the malicious sample; and marking the target file as a malicious sample or a normal file according to whether a preset similarity threshold is reached.
As a preferred scheme of the cross-architecture malicious code detection method, an assembly code in the x86 architecture and an assembly code in the ARM architecture are obtained by utilizing a disassembly tool IDA Pro, all operation code instructions in the x86 architecture are extracted from the assembly code of the x86 architecture, and all operation code instructions in the ARM architecture are extracted from the assembly code in the ARM architecture.
As a preferred scheme of the cross-architecture malicious code detection method, in the process of establishing the mapping relation between the operation code instruction in the x86 architecture and the operation code instruction in the ARM architecture, the instruction set difference of the x86 architecture and the ARM architecture is combined, and the instruction set difference comprises instruction format and function difference.
As a preferred scheme of the cross-architecture malicious code detection method, a dynamic analysis tool is adopted to track and analyze the execution result of the operation code instruction, namely, a target program is operated in a simulator or an actual system, and the execution process of the target program is monitored;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
As a preferred scheme of the cross-architecture malicious code detection method, if the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample;
if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
In a second aspect, the present invention provides a cross-architecture malicious code detection apparatus, comprising:
The operation code instruction extraction module is used for acquiring assembly codes in an x86 architecture and assembly codes in an ARM architecture, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, analyzing the assembly codes in the ARM architecture to extract operation code instructions in the ARM architecture;
the operation code instruction mapping module is used for carrying out one-to-one correspondence on the operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, and establishing the operation code instruction mapping relation between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture to obtain an operation code instruction mapping relation table;
The mapping relation table perfecting module is used for tracking and analyzing an operation code instruction execution result by adopting a dynamic analysis tool for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, and perfecting the operation code instruction mapping relation table according to the dynamic analysis result of the operation code instruction;
The embedded vector generation module is used for inputting the operation code instruction establishing the mapping relation into the 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, counting the occurrence frequency of each sequence and generating an embedded vector with fixed length;
the target file analysis module is used for calculating Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample to obtain similarity between the target file and the malicious sample, and marking the target file as the malicious sample or the normal file according to whether a preset similarity threshold is reached or not.
As a preferred solution of the cross-architecture malicious code detection device, the operation code instruction extraction module is:
Acquiring assembly codes in the x86 architecture and assembly codes in the ARM architecture by using a disassembly tool IDA Pro, and extracting all operation code instructions in the x86 architecture from the assembly codes of the x86 architecture; and extracting all operation code instructions in the ARM architecture from assembly codes in the ARM architecture.
As a preferred scheme of the cross-architecture malicious code detection device, in the operation code instruction mapping module, the instruction set variability of the x86 architecture and the ARM architecture is combined, and the instruction set variability comprises instruction format and function variability.
As a preferred solution of the cross-architecture malicious code detection device, the mapping relation table perfecting module:
running a target program in a simulator or an actual system, and monitoring the execution process of the target program;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
As a preferred solution of the cross-architecture malicious code detection device, the target file analysis module is:
If the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample;
if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
According to the technical scheme, the meaning of the instruction can be accurately understood in cross-architecture analysis by extracting the assembly operation codes under different architectures and establishing the mapping relation, so that the understanding difficulty caused by the architecture difference can be eliminated, the mapped instruction is input into a 3-gram model, the embedded vector with a fixed length is generated, the Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample is calculated, the similarity between the target file and the malicious sample is obtained, and therefore, on the basis of expert knowledge, the rest operation codes are dynamically analyzed to obtain a complete instruction mapping relation table, the mapped instruction operation codes can be effectively utilized to detect the malicious code, the accuracy of malicious code detection under different architectures can be improved, the efficiency and the accuracy of cross-architecture assembly instruction analysis can be remarkably improved, the detection efficiency can be improved, and the time cost can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a schematic flow chart of a method for detecting cross-architecture malicious codes according to an embodiment of the present invention;
FIG. 2 is a schematic technical route diagram of a cross-architecture malicious code detection method according to an embodiment of the present invention;
FIG. 3 is a diagram of a cross-architecture malicious code detection device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present invention should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present invention belongs. The use of the terms "comprising" or "including" and the like in embodiments of the present invention is intended to cover an element or article appearing before the term and equivalents thereof, which are listed after the term, without excluding other elements or articles.
In the present digital age, computer systems and various types of smart devices have become an integral part of modern society. Malicious code detection is an important technology in the field of computer security, with the goal of identifying and preventing threats to systems and users by malware. Malicious code exists in a variety of forms including viruses, worms, trojans, spyware, etc., which may steal user information, corrupt system files, tamper data, launch network attacks, etc. The widespread use of these systems and devices is independent of the underlying software support, and assembly language, which is the low-level language closest to the computer hardware, directly affects the performance and functionality of the system. In assembly language, assembly instructions are the most basic unit of building program logic and executing flows, and understanding and analysis are critical to optimization and security of the system.
With the continued development of computer technology, the processor architecture to which malicious code relates has also become diversified and complex, making it increasingly important to analyze and understand assembler instructions for different architectures. Whether the x86 architecture or the ARM architecture, each architecture has its own unique instruction set and execution. Therefore, the method and the device have important significance for optimizing system performance, improving safety and improving software development efficiency by accurately analyzing and understanding assembly instructions of different architectures.
Currently, assembly instruction analysis for cross-architecture relies mainly on both static and dynamic approaches. The static analysis obtains instruction information by analyzing assembly codes, and has high speed but limited accuracy. The dynamic analysis monitors the running of the program on different architectures to obtain more accurate instruction execution conditions, but has a certain influence on the system performance and cannot cover all conditions. Thus, how to balance accuracy and efficiency in cross-architecture assembly instruction analysis remains an important challenge for current technology.
In view of this, embodiments of the present invention provide a method and an apparatus for detecting cross-architecture malicious codes, which can effectively use mapped instruction operation codes to detect malicious codes. The method is beneficial to improving the accuracy of malicious code detection under different architectures, remarkably improving the efficiency and accuracy of cross-architecture assembly instruction analysis, improving the detection efficiency and reducing the time cost. The following is a specific content of an embodiment of the present invention.
In the following implementation, the conceptual meaning of reference is as follows:
x 86. X86 architecture is a microprocessor-executed computer language instruction set, which refers to the standard numbered abbreviation for the general purpose computer column, also identifies a general purpose set of computer instructions.
ARM architecture, once called advanced reduced instruction set machine, is a 32-bit reduced instruction set processor architecture.
Semantics-the actual behavior and effect of statements and expressions in a programming language. For example, the semantics of a programming language dictate how variables are declared, assigned, and used, as well as the execution rules of control flow statements (e.g., conditional statements and loop statements).
GDB is a powerful debugging tool for helping programmers to debug programs they write. It can be used to analyze and modify running programs, helping to find errors and problems in the program.
The flag bit refers to a binary bit used for representing a specific state or condition in a computer architecture. These flag bits are typically stored in a status register or flag register of the processor and are set or cleared by the processor upon execution of an instruction according to particular conditions.
IDA Pro is a powerful reverse engineering tool for analyzing and disassembling various types of binary files, including executable files, library files, drivers, and the like. The method is widely applied to the fields of security research, vulnerability analysis, malicious code analysis and the like.
L ea instruction an instruction in x86 architecture assembly language, representing loading an effective address.
Adrp instruction an instruction in ARM architecture for loading the high order part of an address into a target register
An add instruction, an instruction in assembly language, for performing an add operation.
Adrp +add instruction set for calculating complete address. In the ARM architecture, this combination is often used to load the address of the data and add an offset to obtain the full memory address.
3-Gram A technique commonly used in text analysis is a statistical-based language model for analyzing a combination of three consecutive units (e.g., letters, words, or other symbols) in text.
Referring to fig. 1 and 2, an embodiment of the present invention provides a method for detecting cross-architecture malicious code, including the following steps:
s1, acquiring assembly codes in an x86 architecture and assembly codes in an ARM architecture, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, and analyzing the assembly codes in the ARM architecture to extract the operation code instructions in the ARM architecture;
S2, carrying out one-to-one correspondence on the operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, and establishing an operation code instruction mapping relation between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture to obtain an operation code instruction mapping relation table;
S3, for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, tracking and analyzing an operation code instruction execution result by adopting a dynamic analysis tool, and perfecting the operation code instruction mapping relation table according to the dynamic analysis result of the operation code instruction;
S4, inputting the operation code instruction establishing the mapping relation into a 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, counting the occurrence frequency of each sequence, and generating an embedded vector with a fixed length;
S5, calculating Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample to obtain similarity between the target file and the malicious sample, and marking the target file as the malicious sample or the normal file according to whether a preset similarity threshold is reached or not.
In this embodiment, in step S1, an assembly code in the x86 architecture and an assembly code in the ARM architecture are obtained by using an disassembly tool IDA Pro, all the operation code instructions in the x86 architecture are extracted from the assembly code of the x86 architecture, and all the operation code instructions in the ARM architecture are extracted from the assembly code in the ARM architecture.
Among these, IDA Pro is a powerful disassembly tool for analyzing and disassembling various types of binary files, including executable files, library files, drivers, and the like. The method is widely applied to the fields of security research, vulnerability analysis, malicious code analysis and the like.
In this embodiment, in step S2, in the process of establishing the mapping relationship between the opcode instruction in the x86 architecture and the opcode instruction in the ARM architecture, the instruction set variability of the x86 architecture and the ARM architecture is combined, where the instruction set variability includes instruction format and function variability.
Specifically, for the extracted operation code instruction in the x86 architecture and the extracted operation code instruction in the ARM architecture, a clear operation code mapping relation is established by using a K-means clustering method. This process needs to take into account differences in instruction sets between the x86 architecture and the ARM architecture, including differences in instruction format, functionality, etc., to ensure the accuracy and integrity of the mapping relationship.
In the operation code mapping relation, the K-means clustering method can divide operation codes (or similar characteristics) into different categories. The method comprises the following specific steps:
Preprocessing the operation code data, including normalization, outlier processing and the like;
initializing, namely randomly selecting K operation codes as initial clustering centers;
calculating the distance between each operation code and each cluster center, and distributing each operation code to the category represented by the cluster center nearest to the operation code;
re-computing the center of each category, typically taking the average of all opcodes within the category;
And iterating, namely repeating the allocation step and the updating step until the clustering center does not change significantly any more or reaches the preset iteration times.
And (3) analyzing the result, namely analyzing the mapping relation and the characteristics of different operation codes according to the final clustering result.
Through the steps, the K-means clustering method can help identify and classify different operation code modes, so that the performance and efficiency of the system are optimized. The operation code instruction mapping relation table style is shown in table 1:
ARM instruction X86 instruction Mapping instructions Instruction meaning
eor xor xor Exclusive or instruction
orr or or Logic or operation
mvn not not Bit inverting operands
asl sal sal Arithmetic left shift operation
rrx rcr rcr Right shift operation with carry loop
b.le jle jle Signed comparison, less than or equal to time jump
b.lt jl jl Signed comparison, less time jump
b.gt jg jg Signed comparison, greater than time jump
tst test cmp Logical comparison of change flag bits
sdiv idiv div Signed division
Table 1 opcode instruction map table style
In the embodiment, in step S3, in the process of tracking and analyzing the execution result of the operation code instruction by adopting a dynamic analysis tool, a target program is operated in a simulator or an actual system, and the execution process of the target program is monitored;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
Specifically, for opcode instructions that fail to map directly, such as the l ea instruction in the x86 architecture and the adrp instruction in the ARM architecture, the dynamic analysis tool GDB is employed for tracking and analysis. This step may run the object program in the simulator or in the real system and monitor its execution. The adrp +add instruction set can be considered an offset address calculation operation by observing the results of the execution of the instruction on different architectures and observing the values in the registers to further determine its equivalence between different architectures, since for adrp instructions an ARM architecture instruction is 32 bits, an additional 32 bits are required to calculate the address for a 64 bit computer, a common solution is to calculate a 64 bit address following the add instruction after the adrp instruction.
In this embodiment, in step S4, the mapped opcode instruction is used as an input to convert three consecutive opcode sequences into embedded vectors using a 3-gram model. These embedded vectors will represent the instruction sequence of each file as feature vectors for similarity calculation in subsequent steps. In step S5, the similarity between the target file and the malicious sample may be obtained by calculating the euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample. If the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample, and if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
In summary, the method comprises the steps of obtaining assembly codes in an x86 architecture and ARM architecture assembly codes, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, analyzing the assembly codes in the ARM architecture to extract operation code instructions in the ARM architecture, performing one-to-one correspondence on operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, establishing an operation code instruction mapping relation table of the operation code instructions in the x86 architecture and the operation code instruction mapping relation table in the ARM architecture, tracking and analyzing operation code instruction execution results by a dynamic analysis tool on the operation code instructions which cannot be directly mapped between the x86 architecture and the ARM architecture, performing the operation code instruction mapping relation table according to the dynamic analysis results of the operation code instructions, inputting the operation code instructions with the established mapping relation into a 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, counting occurrence frequency of each sequence, generating an embedding vector with a fixed length, calculating whether a malicious sample size is similar to a malicious sample file or not, and obtaining a malicious sample size according to a preset sample size, and a malicious sample size, and obtaining a malicious sample similarity. According to the invention, the method is used for extracting the assembly operation codes under different architectures and establishing clear mapping relations by using a K-means clustering method, so that the meaning of the instruction can be accurately understood in cross-architecture analysis, and the method is beneficial to eliminating the difficulty in understanding caused by architecture differences. And the unmapped assembly instructions are tracked and analyzed by adopting a dynamic analysis tool, and the equivalence of the unmapped assembly instructions among different architectures is further determined, so that the comprehensiveness and the accuracy of analysis are improved. Conventional static analysis may not adequately obtain information during dynamic execution, while dynamic analysis may be affected by system performance and may not cover all possible execution paths. The mapped instruction operation code is used as input, and the 3-gram model is utilized to convert three continuous operation code sequences into embedded vectors. And obtaining the similarity between the embedded vector of the target file and the embedded vector of the malicious sample by calculating the Euclidean distance between the embedded vector and the embedded vector. And marking the target file as a malicious sample or a normal file according to the set similarity threshold. According to the method, on the basis of the complete instruction mapping relation table, the embedded vectors are generated by using the 3-gram model, and the detection efficiency of the cross-architecture malicious codes is effectively improved through calculation and judgment among the embedded vectors.
It should be noted that, the method of the embodiment of the present invention may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present invention, the devices interacting with each other to accomplish the method.
It should be noted that the foregoing describes some embodiments of the present invention. In some cases, the recited acts or steps may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Referring to fig. 3, based on the same inventive concept, corresponding to the method of any embodiment, an embodiment of the present invention further provides a cross-architecture malicious code detection device, including:
The operation code instruction extracting module 100 is configured to obtain an assembly code in an x86 architecture and an assembly code in an ARM architecture, parse the assembly code in the x86 architecture to extract an operation code instruction in the x86 architecture, and parse the assembly code in the ARM architecture to extract an operation code instruction in the ARM architecture;
The operation code instruction mapping module 200 is configured to perform one-to-one correspondence on operation code instructions with definite semantics between the x86 architecture and the ARM architecture by using a K-means clustering algorithm, and establish an operation code instruction mapping relationship between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture, so as to obtain an operation code instruction mapping relationship table;
The mapping relation table perfecting module 300 is configured to track and analyze an execution result of an operation code instruction by using a dynamic analysis tool for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, and perfects the mapping relation table of the operation code instruction according to the dynamic analysis result of the operation code instruction;
The embedded vector generation module 400 is configured to input an operation code instruction for creating a mapping relationship into the 3-gram model, use a plurality of continuous operation code instruction sequences as a sequence, count occurrence frequency of each sequence, and generate an embedded vector with a fixed length;
The target file analysis module 500 is configured to calculate the euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample, obtain the similarity between the target file and the malicious sample, and mark the target file as the malicious sample or the normal file according to whether the similarity threshold reaches the preset similarity threshold.
In this embodiment, the opcode instruction fetch module 100:
Acquiring assembly codes in the x86 architecture and assembly codes in the ARM architecture by using a disassembly tool IDA Pro, and extracting all operation code instructions in the x86 architecture from the assembly codes of the x86 architecture; and extracting all operation code instructions in the ARM architecture from assembly codes in the ARM architecture.
In this embodiment, in the opcode instruction mapping module 200, the instruction set differences of the x86 architecture and the ARM architecture are combined, and the instruction set differences include instruction format and function differences.
In this embodiment, the mapping relation table perfecting module 300:
running a target program in a simulator or an actual system, and monitoring the execution process of the target program;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
In this embodiment, in the target file analysis module 500:
If the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample;
if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
The device of the foregoing embodiment is configured to implement a cross-architecture malicious code detection method according to any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the invention also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for detecting the cross-architecture malicious code according to any embodiment when executing the program.
Fig. 4 shows a more specific hardware architecture of an electronic device provided by the present embodiment, which may include a processor 610, a memory 620, an input/output interface 630, a communication interface 640, and a bus 650. Wherein processor 610, memory 620, input/output interface 630, and communication interface 640 enable communication connections among each other within the device via bus 650.
The processor 610 may be implemented by a general-purpose CPU (Centra l Process I ng Un it ), a microprocessor, an application-specific integrated circuit (APP L I CAT I on SPEC I F I C I NTEGRATED CI rcu it, AS ic), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.
The memory 620 may be implemented in the form of ROM (Read On l y Memory ), RAM (Random Access Memory, random access memory), static storage, dynamic storage, and the like. Memory 620 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented in software or firmware, relevant program codes are stored in memory 620 and invoked for execution by processor 610.
The input/output interface 630 is used for connecting with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
The communication interface 640 is used to connect a communication module (not shown in the figure) to enable communication interaction between the present device and other devices. The communication module may implement communication through wired mode (such as USB, network cable, etc.), or may implement communication through wireless mode (such as mobile network, WI F I, bluetooth, etc.).
Bus 650 includes a path to transfer information between components of the device (e.g., processor 610, memory 620, input/output interface 630, and communication interface 640).
It should be noted that although the above device only shows the processor 610, the memory 620, the input/output interface 630, the communication interface 640, and the bus 650, in the implementation, the device may further include other components necessary for achieving normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement a cross-architecture malicious code detection method according to any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Based on the same inventive concept, the present invention also provides a non-transitory computer readable storage medium corresponding to the method of any embodiment, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to execute a cross-architecture malicious code detection method according to any embodiment.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute a cross-architecture malicious code detection method according to any one of the foregoing embodiments, and has the beneficial effects of corresponding method embodiments, which are not described herein.
It will be appreciated by persons skilled in the art that the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the invention is limited to these examples, that combinations of technical features in the above embodiments or in different embodiments may also be implemented in any order, and that many other variations of the different aspects of the embodiments of the invention as described above exist within the spirit of the invention, which are not provided in detail for the sake of brevity.
Additionally, well-known power/ground connections to integrated circuit (I C) chips and other components may or may not be shown in the drawings provided to simplify the illustration and discussion, and so as not to obscure embodiments of the present invention. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present invention are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that embodiments of the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the invention, are intended to be included within the scope of the invention.

Claims (10)

1. A method of cross-architecture malicious code detection, comprising:
Acquiring assembly codes in an x86 architecture and assembly codes in an ARM architecture, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, and analyzing the assembly codes in the ARM architecture to extract the operation code instructions in the ARM architecture;
Performing one-to-one correspondence on the operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, and establishing an operation code instruction mapping relation between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture to obtain an operation code instruction mapping relation table;
for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, tracking and analyzing an operation code instruction execution result by adopting a dynamic analysis tool, and perfecting the operation code instruction mapping relation table according to the dynamic analysis result of the operation code instruction;
Inputting an operation code instruction establishing a mapping relation into a 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, and counting the occurrence frequency of each sequence to generate an embedded vector with a fixed length;
Calculating Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample to obtain similarity between the target file and the malicious sample; and marking the target file as a malicious sample or a normal file according to whether a preset similarity threshold is reached.
2. The method for detecting the cross-architecture malicious code according to claim 1, wherein an disassembly tool IDA Pro is utilized to obtain the assembly codes in the x86 architecture and the ARM architecture, and extract all the operation code instructions in the x86 architecture from the assembly codes in the x86 architecture, and extract all the operation code instructions in the ARM architecture from the assembly codes in the ARM architecture.
3. The method for detecting the cross-architecture malicious code according to claim 1, wherein in the process of establishing the mapping relation between the operation code instruction in the x86 architecture and the operation code instruction in the ARM architecture, the instruction set difference of the x86 architecture and the ARM architecture is combined, and the instruction set difference comprises instruction format and function difference.
4. The method for detecting the cross-architecture malicious codes according to claim 1, wherein in the process of tracking and analyzing the execution result of the operation code instruction by adopting a dynamic analysis tool, a target program is operated in a simulator or an actual system, and the execution process of the target program is monitored;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
5. The method for detecting cross-architecture malicious codes according to claim 1, wherein if the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample;
if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
6. A cross-architecture malicious code detection apparatus, comprising:
The operation code instruction extraction module is used for acquiring assembly codes in an x86 architecture and assembly codes in an ARM architecture, analyzing the assembly codes in the x86 architecture to extract operation code instructions in the x86 architecture, analyzing the assembly codes in the ARM architecture to extract operation code instructions in the ARM architecture;
the operation code instruction mapping module is used for carrying out one-to-one correspondence on the operation code instructions with definite semantics between the x86 architecture and the ARM architecture through a K-means clustering algorithm, and establishing the operation code instruction mapping relation between the operation code instructions in the x86 architecture and the operation code instructions in the ARM architecture to obtain an operation code instruction mapping relation table;
The mapping relation table perfecting module is used for tracking and analyzing an operation code instruction execution result by adopting a dynamic analysis tool for the operation code instruction which cannot be directly mapped between the x86 architecture and the ARM architecture, and perfecting the operation code instruction mapping relation table according to the dynamic analysis result of the operation code instruction;
The embedded vector generation module is used for inputting the operation code instruction establishing the mapping relation into the 3-gram model, taking a plurality of continuous operation code instruction sequences as a sequence, counting the occurrence frequency of each sequence and generating an embedded vector with fixed length;
the target file analysis module is used for calculating Euclidean distance between the embedded vector of the target file and the embedded vector of the malicious sample to obtain similarity between the target file and the malicious sample, and marking the target file as the malicious sample or the normal file according to whether a preset similarity threshold is reached or not.
7. The cross-architecture malicious code detection apparatus of claim 6, wherein, in the opcode instruction fetch module:
Acquiring assembly codes in the x86 architecture and assembly codes in the ARM architecture by using a disassembly tool IDA Pro, and extracting all operation code instructions in the x86 architecture from the assembly codes of the x86 architecture; and extracting all operation code instructions in the ARM architecture from assembly codes in the ARM architecture.
8. The cross-architecture malicious code detection device of claim 6, wherein the opcode instruction mapping module combines instruction set differences of the x86 architecture and the ARM architecture, the instruction set differences comprising instruction format, functional differences.
9. The cross-architecture malicious code detection device of claim 6, wherein the mapping relation table perfecting module:
running a target program in a simulator or an actual system, and monitoring the execution process of the target program;
Observing the results of execution of instructions in the x86 architecture and the ARM architecture, and observing the values in registers determines the equivalence of a target program between the x86 architecture and the ARM architecture.
10. The cross-architecture malicious code detection apparatus of claim 6, wherein, in the target file analysis module:
If the similarity between the target file and the malicious sample reaches a preset similarity threshold, marking the target file as the malicious sample;
if the similarity between the target file and the malicious sample does not reach the preset similarity threshold, marking the target file as a normal file.
CN202411808808.2A 2024-12-10 2024-12-10 A cross-architecture malicious code detection method and device Pending CN119740232A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411808808.2A CN119740232A (en) 2024-12-10 2024-12-10 A cross-architecture malicious code detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411808808.2A CN119740232A (en) 2024-12-10 2024-12-10 A cross-architecture malicious code detection method and device

Publications (1)

Publication Number Publication Date
CN119740232A true CN119740232A (en) 2025-04-01

Family

ID=95131248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411808808.2A Pending CN119740232A (en) 2024-12-10 2024-12-10 A cross-architecture malicious code detection method and device

Country Status (1)

Country Link
CN (1) CN119740232A (en)

Similar Documents

Publication Publication Date Title
US9996696B2 (en) Systems and methods to optimize execution of a software program using a type based self assembling control flow graph
CN106716361B (en) Compiler cache for runtime routine redundancy tracking
US8850581B2 (en) Identification of malware detection signature candidate code
CN102402479B (en) For the intermediate representation structure of static analysis
CN101847121B (en) Method for discovering software vulnerabilities
US10867031B2 (en) Marking valid return targets
US11474795B2 (en) Static enforcement of provable assertions at compile
CN109271789B (en) Malicious process detection method and device, electronic equipment and storage medium
US20170161065A1 (en) Generating Code Coverage Data for use with a Computing Device and System
US20110016455A1 (en) Power Profiling for Embedded System Design
US9715377B1 (en) Behavior based code recompilation triggering scheme
CN115795474A (en) Firmware program vulnerability detection method and system
CN108334903A (en) A kind of instruction SDC fragility prediction techniques based on support vector regression
CN108874656A (en) Code test method, device, readable storage medium storing program for executing and computer equipment
CN105793864A (en) System and method of detecting malicious multimedia files
US11886589B2 (en) Process wrapping method for evading anti-analysis of native codes, recording medium and device for performing the method
CN119740232A (en) A cross-architecture malicious code detection method and device
Letychevskyi et al. Fuzz testing technique and its use in cybersecurity tasks
CN112199160B (en) Virtual instruction restoration method, apparatus, device and storage medium
US20110264893A1 (en) Data processor and ic card
CN118171284B (en) Kernel data race detection method based on patch and concurrent behavior pattern analysis
CN116305110B (en) Kernel-driven security verification method, terminal equipment and storage medium
CN119808104A (en) A method for detecting backdoors in ERC-20 contracts based on data flow analysis
Pham et al. RAX-ClaMal: Dynamic Android malware classification based on RAX register values
Taylor Structural Checking Tool Restructure and Matching Improvements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination