CN119883435A - Instruction conversion memory conflict optimization method based on instruction stream feature recognition - Google Patents
Instruction conversion memory conflict optimization method based on instruction stream feature recognition Download PDFInfo
- Publication number
- CN119883435A CN119883435A CN202510377539.7A CN202510377539A CN119883435A CN 119883435 A CN119883435 A CN 119883435A CN 202510377539 A CN202510377539 A CN 202510377539A CN 119883435 A CN119883435 A CN 119883435A
- Authority
- CN
- China
- Prior art keywords
- instruction
- shared memory
- arm
- memory
- instruction sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44552—Conflict resolution, i.e. enabling coexistence of conflicting executables
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/3017—Runtime instruction translation, e.g. macros
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses an instruction conversion memory conflict optimization method based on instruction stream feature recognition, which adopts instruction stream analysis to determine that shared memory among processes forms a first shared memory list in the process of pre-executing executable files in an ARM many-core system in a dynamic instruction conversion mode, and on the basis, executable files are loaded and executed again, and memory addresses of memory types in the first shared memory list are converted into instruction sequences consisting of locking instruction sequences, second ARM instruction sequences and unlocking instruction sequences, and meanwhile conversion execution of the executable files is completed based on a cache consistency protocol, so that memory access conflicts of x86 programs on the ARM many-core system are obviously reduced, and powerful support is provided for realizing high-performance dynamic instruction conversion.
Description
Technical Field
The invention belongs to the technical field of computer instruction conversion and multi-core systems, and particularly relates to an instruction conversion memory conflict optimization method based on instruction stream feature recognition.
Background
When the ARM many-core system executes the x86 program in a dynamic instruction conversion mode, a plurality of ARM computing cores can access the memory at the same time, memory access conflict is easy to generate at the moment, and the performance of the ARM many-core system is further reduced. The existing memory conflict solution is difficult to effectively cope with complex memory access conditions, which leads to remarkable increase of memory access delay, and the overall performance of the ARM many-core system cannot reach expectations.
Disclosure of Invention
In view of this, the invention provides an instruction conversion memory conflict optimization method based on instruction stream feature recognition, which adopts instruction stream analysis to realize that the feature recognition of a shared memory completes the dynamic conversion from an x86 instruction to an ARM instruction.
The invention provides an instruction conversion memory conflict optimization method based on instruction stream feature recognition, which specifically comprises the following steps:
Step 1, pre-executing an executable file in an ARM many-core system in a dynamic instruction conversion mode, analyzing a first x86 instruction sequence related to x86 system call by adopting an instruction stream, analyzing a first ARM instruction sequence obtained by converting the first x86 instruction sequence, and determining a shared memory among processes to form a first shared memory list;
Step 2, loading and executing an executable file in a dynamic instruction conversion mode, if the current instruction to be converted is a memory access instruction and the memory address is in the first shared memory list, converting the current instruction to be converted into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, executing step 3 when the current instruction to be converted is a read operation, and executing step 4 when the current instruction to be converted is a write operation, otherwise, executing step 5;
step 3, if the current computing core corresponds to the invalid state of the cache line, sending a read request to the bus to read data, and setting the state of the cache line to be a shared state or an exclusive state after the reading is completed;
Step 4, if the current computing core corresponds to the exclusive state of the cache behavior, modifying the data in the cache line and setting the data as a modified state; if the shared state is the shared state, the invalid message is sent to other computing cores, the invalid message is set to be a modified state after the response confirmation, and then the write operation is executed;
and step 5, if the executable file is not completed, executing the step 2, otherwise ending the flow.
Further, in the step 1, the method of determining the first x86 instruction sequence related to the x86 system call by using instruction flow analysis includes that after the system call number of the shared memory related x86 system call is identified by using instruction flow analysis, the first x86 instruction sequence is formed from an instruction containing the system call number to a soft interrupt trigger instruction.
Further, the method for determining the shared memory between the processes to form the first shared memory list by analyzing the first ARM instruction sequence obtained by converting the first x86 instruction sequence in the step 1 is as follows:
If the first ARM instruction sequence comprises a system call for creating or acquiring a shared memory segment, an ARM instruction A for saving a shared memory key value is added after the first ARM instruction sequence, if the first ARM instruction sequence comprises a system call for adding the shared memory segment to an address space of a calling process, an ARM instruction B for saving a shared memory address is added after the first ARM instruction sequence, a process ID of a current process, the shared memory key value and the shared memory address are acquired, a shared memory list is formed by the shared memory key value, the shared memory address and the process ID for creating the shared memory, and the shared memory list is used as the first shared memory list.
Further, traversing the process ID in the shared memory list as a target process, obtaining page table entries of all shared memory pages of the target process, marking the page table entries as original page table entries, releasing the mapping relation from the shared memory pages to the physical memory pages, storing the information of the original page table entries into the page table of the target process, and storing the shared memory addresses triggering the page missing abnormality and having the abnormal addresses belonging to the shared memory list into a first shared memory list.
Further, the locking instruction sequence in step 2 is composed of an instruction to set a lock variable, an atomic operation LDREXB to load an exclusive byte, and an atomic operation STREXB to store an exclusive byte.
Further, the unlocking instruction sequence in the step 2 is composed of an instruction for modifying a lock variable.
Further, the second ARM instruction sequence in step 2 is composed of an ARM instruction with the same function as the current instruction to be converted and other access related instructions.
Further, in the step 3, the read request is sent to the bus to read the data from the cache line or the memory of the other computing core.
Further, the x86 system call in step 1 includes shmget, shmat, shmdt and shmctl.
Advantageous effects
In the process of pre-executing the executable file in the ARM many-core system in a dynamic instruction conversion mode, the shared memory among processes is determined to form a first shared memory list by adopting instruction stream analysis, and the executable file is loaded and executed again on the basis, and an access memory class to be converted instruction with a memory address in the first shared memory list is converted into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, and meanwhile, the conversion execution of the executable file is completed based on a cache consistency protocol, so that memory access conflict of an x86 program on the ARM many-core system is obviously reduced, and powerful support is provided for realizing high-performance dynamic instruction conversion.
Drawings
Fig. 1 is a flow chart of an instruction conversion memory conflict optimization method based on instruction stream feature recognition.
Detailed Description
The present invention will be described in detail below with reference to examples of embodiments shown in the drawings.
In the process of pre-executing an executable file in an ARM many-core system in a dynamic instruction conversion mode, adopting instruction flow analysis to determine a shared memory among processes to form a first shared memory list, loading the executable file again on the basis, converting an access type to-be-converted instruction with a memory address in the first shared memory list into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, and completing conversion execution of the executable file on the basis of a cache consistency protocol.
The invention provides an instruction conversion memory conflict optimization method based on instruction stream feature recognition, which specifically comprises the following steps as shown in a figure 1:
Step 1, pre-executing an executable file in an ARM many-core system in a dynamic instruction conversion mode, analyzing and identifying a system call number of an x86 system call related to a shared memory through an instruction stream, forming a first x86 instruction sequence from an instruction containing the system call number to a soft interrupt trigger instruction, converting the first x86 instruction sequence into a first ARM instruction sequence, adding an ARM instruction A for saving a shared memory key value after the first ARM instruction sequence if the first ARM instruction sequence contains the system call for creating or acquiring the shared memory segment, adding an ARM instruction B for saving a shared memory address after the first ARM instruction sequence if the first ARM instruction sequence contains the system call for adding the shared memory segment to an address space of a calling process, acquiring a process ID (identity) of a current process, a shared memory key value and the shared memory address, and saving the shared memory key value, the shared memory address and the process ID for creating the shared memory in a shared memory list.
There are various system calls under the existing x86 architecture for implementing operations related to shared memory, including shmget, shmat, shmdt, shmctl, etc., where shmget is used to create or obtain a shared memory segment, shmat is used to attach the shared memory segment to the address space of the calling process, shmdt is used to separate the shared memory segment from the address space of the calling process, and shmctl is used to perform control operations on the shared memory segment. Typically, the system calls associated with shared memory in the ARM architecture and the x86 architecture are identical, e.g., the system call similar to the x86 architecture shmget in the ARM architecture is also shmget.
Step 2, traversing the process ID in the shared memory list as a target process, obtaining page table entries of all shared memory pages of the target process, marking the page table entries as original page table entries, releasing the mapping relation from all the shared memory pages to the physical memory pages, enabling all the processes to be incapable of using all the shared memory pages, and storing the information of the original page table entries into a page table of the target process.
Each process has its own independent virtual address space, and the page table of the process is used to translate the virtual address used by the process into the actual physical memory address.
And 3, when the page missing abnormality is triggered and the abnormal address belongs to the shared memory list, storing the shared memory address corresponding to the abnormal memory page into the first shared memory list, and then restoring the mapping relation between the abnormal memory page and the physical memory page to finish the pre-execution of the executable file.
The invention pre-executes the executable file in the ARM many-core system in a dynamic instruction conversion mode, can pre-acquire the instruction sequence related to the system call, namely the first x86 instruction sequence, converts the first x86 instruction sequence into the first ARM instruction sequence according to the existing dynamic instruction conversion mode, then increases different ARM instructions according to the system call contained in the first ARM instruction sequence to store related information of the shared memory, preliminarily determines the related shared memory, and further can further determine that the shared memory which is actually shared among a plurality of processes forms a first shared memory list by modifying page table items of the shared memory and page table construction page fault triggering conditions of processes creating the shared memory.
And 4, loading and executing an executable file in an ARM many-core system in a dynamic instruction conversion mode to acquire a current to-be-converted instruction, executing a step 5 when the current to-be-converted instruction is a read operation and executing a step 6 when the current to-be-converted instruction is a write operation if the current to-be-converted instruction is a memory access instruction and the related memory address is located in the first shared memory list, and otherwise executing a step 7.
And 5, converting the current instruction to be converted into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, simultaneously, if the state of the current instruction to be converted corresponding to the cache line in the current computing core is an invalid state, sending a read request to the bus by the current computing core to read data, setting the state of the cache line where the data is located into a shared state or an exclusive state after the reading is completed, and if the state is the shared state or the exclusive state, directly reading the data from the cache line by the current computing core, and executing the step 7.
The locking instruction sequence consists of an instruction for setting a lock variable, an atomic operation LDREXB for loading exclusive bytes and an atomic operation STREXB for storing exclusive bytes, and the unlocking instruction sequence consists of an instruction for modifying the lock variable. The current way in which a compute core sends a read request to a bus to read data is to read data from a cache line or memory of the other compute core. The second ARM instruction sequence is composed of ARM instructions with the same functions as the current instructions to be converted and other access related instructions.
For example, a compute core A executing a read instruction accesses certain address data whose cache line state is Invalid (Invalid). If the data is in the cache of the computing core B and is in a Modified state (Modified), the computing core B writes the data back to the main memory, and then sends the data to the computing core a, and the cache line states of the computing core B and the computing core a are changed to a Shared state (Shared).
In addition, the cache line of the computing core A is in a shared state, and when the data corresponding to the cache line is accessed by executing a read instruction, the cache line is directly read, and the state is still in the shared state.
And 6, converting the current instruction to be converted into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, wherein if the state of the current instruction to be converted corresponding to the cache line in the current computing core is an exclusive state, the current computing core directly modifies data in the cache line and sets the state of the cache line to be a modified state, if the state is a shared state, an invalidating message is sent to other computing cores, after the other computing cores respond to confirmation, the state of the cache line is set to be the modified state, then writing operation is executed, if the state is the invalidating state, a read-exclusive request is sent to a bus, and after the data is acquired, the state of the cache line is set to be the modified state, then writing operation is executed, and if the state is the modified state, the writing operation is directly executed.
Specifically, when the cache line state is in an Exclusive state (Exclusive), since the data only exists in the cache of the current computing core and is consistent with the main data, the computing core may directly modify the data in the cache line and change the cache line state to a modified state. At this point, no communication with other computing cores is required. For example, the cache line of the computing core a is in an exclusive state, and after the write instruction is executed to modify the cache line data, the cache line state is changed to a modified state.
When a computing core is to write to a cache line in a shared state, an Invalidate message (Invalidate message) needs to be sent to the other computing cores first, informing them to Invalidate the copy of the data. After receiving the message, the other computing cores set the corresponding cache line state in the own cache to be invalid. After all other computing cores respond to the acknowledgement, the computing core initiating the write operation converts its own cache line state to a modified state and then performs the write operation. For example, the cache lines of the compute cores A, B, C all have copies of certain data in the shared state. When the computing core A executes the writing instruction, an invalidating message is sent to the computing cores B and C, the corresponding cache line state is set to be invalid after the computing cores B and C receive the message and replies confirmation, and the computing core A changes the own cache line state to be modified after receiving the confirmation and then performs writing operation.
When the cache behavior is inactive, the compute core needs to first obtain a valid copy of the data, so it will issue a Read-Exclusive request to the bus, which indicates that it is to Read the data and to Exclusive the subsequent write permission of the data. If the other computing cores have valid copies of the data in the cache line, the computing core with the copy sends the data to the requesting core and sets its cache line state to an invalid state, and if the other computing cores also have no valid copies, the data is read from the main memory. After the data is acquired, the cache line state is set to a modified state, and then writing operation is performed. For example, the cache line of compute core A is in an invalid state and a read-exclusive request is issued when a write instruction is executed. If the computing core B has an effective copy of the data, the computing core B sends the data to the computing core A and sets the self cache line to be in an invalid state, and the computing core A changes the cache line state to a modified state after receiving the data and then performs writing operation.
When a cache line is modified, the computing core may directly modify the data in the cache line because the computing core has a unique valid copy of the data and has been modified. The cache line state remains in the modified state. For example, the cache line of the computing core a is in a modified state, and when the write instruction is executed to modify data, the cache line is directly operated, and the state is still in the modified state.
And 7, executing the step 4 if the executable file is not completed, otherwise ending the flow.
Examples
The method for optimizing the instruction conversion memory conflict based on the instruction flow characteristic recognition provided by the embodiment of the invention realizes the low memory conflict conversion execution of the x86 architecture executable file on the ARM many-core system by modifying the memory code generation module in the dynamic instruction conversion engine, and comprises the following steps:
s1, detecting and finding a shared memory area of a target program based on the instruction stream characteristics. The method comprises the following specific steps:
S1.1, identifying a memory allocation related function call in the dynamic instruction conversion process. Under the x86 architecture, for the underlying system call related to shared memory creation, the assembly level call is as follows:
moveax, < shmgetsyscallnumber >// put System Call number into eax register
Movebx, < key >// put shared memory key values into ebx registers for identifying shared memory
Movecx, < size >// put into shared memory size
Movedx, < flags >// put in flag bit, related settings such as rights etc
Int0x80// trigger soft interrupt, enter kernel mode to execute system call
After ARM architecture conversion, realizing the ARM system call corresponding to shmget:
movr7, < shmgetsyscallnumberforARM >// put the shmget System call number under ARM into the r7 register
Movr0, < key >// put shared memory key into r0 register
Movr1, < size >// shared memory size put into r1 register
Movr2, < flags >// flag bit is put into r2 register
Svc0x00// trigger system call (use svc instruction in ARM instead of x86 int instruction)
The following code is inserted after the translated ARM instruction to save the shared memory address, where r0 is the return value after svc0x 00:
str 0, [ shared_memory_id ]// subsequent code can read return values from shared_memory_id at any time for analysis
And establishing a mapping between the shared memory ID and the shared memory size, wherein the shared memory ID is stored in an address shared_memory_id, and the shared memory size is stored in an r1 register.
S1.2, tracking memory mapping related operation. Under the x86 architecture, after the shared memory segment identifier is obtained, it needs to be mapped to the address space of the process, which is usually achieved by calling the shmat function. The assembly level call is as follows:
moveax, < shmatsyscallnumber >// System Call number put into eax
Movebx, < shmid >// the previously acquired shared memory segment identifier is put into ebx
Movecx, < addr >// address write for desired mapping ecx
Movedx, < flags >// mapping related flag bit is put into edx
Memory mapping for int0x80// execution system call
The ARM architecture converts the following instructions:
movr7, < shmatsyscallnumberforARM >// shmat System call number under ARM is put into r7
Movr0, < shmid >// shared memory segment identifier is put into r0
Movr1, < addr >// desired map address put into r1
Movr2, < flags >// mapping flag bit is put into r2
Svc0x00// trigger system call completion mapping
The following code is inserted after the translated ARM instruction to save the shared memory address, where r0 is the return value after svc0x 00:
str 0, [ shared_memory_address ]// subsequent code analyzes by reading the return value from shared_memory_address
And establishing a mapping between the shared memory address and the shared memory size based on the mapping between the shared memory ID and the shared memory size established in the previous step.
S1.3, constructing a shared memory list through the established mapping, which is marked as sharedMemoryList, wherein each element of the list comprises a starting address and length of a shared memory and an ID of a process for creating the shared memory area.
The existing shared memory detection method mainly relies on identifying a specific system call function name to judge the creation operation of the shared memory. However, in a dynamic instruction conversion scenario, function names may undergo complex conversions, confusion, or optimization, resulting in failure of conventional approaches. The invention provides an intelligent recognition method based on the characteristic of an instruction stream, which can more accurately judge whether shared memory creation operation exists or not by analyzing a specific mode and register operation characteristic in the instruction stream and recognizing from the layer of the bottommost system call.
S2, tracking sharedMemoryList the shared memory to determine the memory which can be accessed by other processes. In S1, it is only possible to determine whether the memory is used as the shared memory and cannot determine whether the memory is accessed by other processes, but access memory conflicts may occur only in memory spaces that are accessed by other processes, so that determination and identification are also required.
The method comprises the following specific steps:
S2.1, traversing all processes in sharedMemoryList, for each process, recording the ID of the process as processID, acquiring a page table item of a shared memory page of the process, and copying the page table item as originalPageTable.
S2.2, for the shared memory page recorded in sharedMemoryList, the mapping relation from the shared memory page to the physical memory page is released, thereby the mapping of all processes to the shared memory page is released, then the access process is recorded by utilizing the page fault exception tracking record, the page mapping is released, and the mmap_lock is needed to be held, and the method specifically comprises the following steps:
pte_t*pte=get_pte(target_vma->vm_mm,target_addr)
pte_clear(target_vma->vm_mm,target_addr,pte)
flush_tlb_page(target_vma,target_addr)
s2.3, traversing all processes in sharedMemoryList, copying originalPageTable back to a page table of the process processID for each process with ID processID, namely restoring the process page table released in the last step, thereby realizing that page fault abnormality is triggered only when other processes access the shared memory in sharedMemoryList.
S2.4, when the process accesses the page and triggers the page missing exception, if the address of the exception is sharedMemoryList, the memory page is counted into usedSharedMemoryList, and the mapping is restored to process the current exception and enable the current process to normally continue to execute, wherein the specific example is as follows:
set_ PTE _at (vmf- > vma- > vm_mm, vmf- > address, vmf- > PTE, orig_ PTE)// resetting original PTE
Structpage pages = pte _page (orig_ pte)// increase reference count of physical pages (prevent released)
Get_page (page)// Add_ refcount
If (vmf- > FLAGs & FAULT_FLAG_WRITE)// marks pages as accessed/dirty
Pte = pte _ mkdirty (orig pte)// flushing TLB, optionally set pte _ at may be implicitly processed
flush_tlb_page(vmf->vma,vmf->address)
ReturnVM _FAULT_ NOPAGE// tell kernel that the mapping has been restored
S3, for the determined shared memory which can be accessed by a plurality of processes, namely the shared memory recorded in usedSharedMemoryList, a conflict avoidance method based on a lock mechanism is used to ensure the exclusive access to the shared memory. The method comprises the following specific steps:
s3.1, realizing and initializing the lock. In the generated ARM access instruction, an atomic operation and a specific memory area are used as lock variables. First, a byte is allocated in memory as a lock flag, initialized to 0, indicating an unlocked state. Assuming that the address of the lock is stored in r0, the following is ARM assembly code that initializes the lock:
LDRr1,=0
STRBr1,[r0]
S3.2, detecting whether the access address is in the range of a certain shared memory in the established shared memory list, and if so, executing the following steps:
Locking operation. When the ARM computing core needs to access the shared memory region, an attempt will be made to acquire the lock. Locking is accomplished using atomic operations LDREXB (load exclusive byte) and STREXB (store exclusive byte). If LDREXB loads a value of 0 and STREXB successfully stores a1, this indicates that the core successfully acquired the lock, otherwise, it will retry. The ARM assembly code for the locking operation is as follows:
lock_loop:
LDREXBr2,[r0]
CMPr2,#0
BNElock_loop
MOVr3,#1
STREXBr4,r3,[r0]
CMPr4,#0
BNElock_loop
LDRr5, [ r6]// assume that the shared data address is stored in r6
ADDr5,r5,#1
STRr5,[r6]
And (5) unlocking operation. When the operation is completed, an unlocking operation is performed by storing 0 to the lock variable. Assuming that the address of the lock is stored in r0 as ARM assembly code for the unlock operation:
LDRr1,=0
STRBr1,[r0]
LDRr8,=1
STRBr8, [ r7]// assume that a memory flag bit is used as a semaphore, stored in r7
The existing lock mechanism is possibly based on library functions of a high-level language, and the lock is realized by adopting ARM assembled atomic operation, so that the additional expense of the high-level language is avoided, the granularity of the lock can be controlled more accurately, the concurrent control of memory access can be realized on a lower level, the lock mechanism is suitable for the fine-granularity access control of shared memory in an ARM many-core environment, and the synchronism and the safety of the memory access are improved. Meanwhile, the unlocking operation is combined with a subsequent notification mechanism, so that the operation of a plurality of cores can be more flexibly coordinated after unlocking, blind waiting of other cores is avoided, and the parallelism of the whole system is improved.
S4, managing the cache state by using a cache consistency protocol, and avoiding inconsistent memory access. The cache stores a copy of the shared memory, each computing core maintains its own private cache, and there may be a copy of the same shared memory in the private cache. The method comprises the following specific steps:
S4.1, reducing access conflict by using a cache consistency protocol of ARM, such as MESI. ARM many-core systems typically follow the MESI protocol to maintain cache coherency. In the conversion engine, the advantages of the MESI protocol are fully utilized by the instructions after program conversion. For example, during a memory access, an operation is determined based on the MESI state (Modified, exclusive, shared, invalid).
And S4.2, when one core reads data, if the cache state is Invalid, a read request is initiated, the data is obtained from the memory or the caches of other cores, and the cache state is updated to Shared or Exclusive. Assuming that the data address is stored in r0, the following is an example of ARM assembly code at the time of a read operation:
LDRr1,[r0]
MRCp of the two-dimensional structure of the glass fiber, 0, r2, c0,0// checking buffer status
CMPr2, #0// if the state is Invalid, initiate the read request
BEQread_request
Buse _data// if the state is Shared or Exclusive, directly use the data
Read_request:// initiate read request, obtain data from memory or other core's cache
LDRr1,[r0]
MCRp15,0, r3, c0,0// update cache state is Shared or Exclusive, where the cache state is set by coprocessor instructions, specific instructions being hardware specific
LDRr5, [ r4]// assuming that the snoop address is stored in r4, if the data is modified, a re-read may be required
CMPr5, #1// hypothesis 1 indicates that the data is being modified
BEQreread
Buse_data
reread:
LDRr1,[r0]
MCRp15,0, r6, c0,0// again update cache state
use_data:
Existing program translations may not adequately take into account the importance of cache coherency protocols in memory access conflicts. According to the invention, the MESI protocol of ARM is utilized at the assembly level, so that the active management of the cache state in the multi-core environment is ensured, inconsistent access of a plurality of cores to the same data is avoided, memory access conflict caused by cache inconsistency is reduced, and the consistency and performance of data access are improved. The consideration of a cache monitoring mechanism is increased, when data is read, whether other cores are modifying the data can be checked according to the current cache state, the guarantee of data consistency is further enhanced, the data conflict caused by cache update delay is avoided, and the method is an innovative expansion on the original basis.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
1. The instruction conversion memory conflict optimization method based on the instruction stream feature recognition is characterized by comprising the following steps of:
Step 1, pre-executing an executable file in an ARM many-core system in a dynamic instruction conversion mode, analyzing a first x86 instruction sequence related to x86 system call by adopting an instruction stream, analyzing a first ARM instruction sequence obtained by converting the first x86 instruction sequence, and determining a shared memory among processes to form a first shared memory list;
Step 2, loading and executing an executable file in a dynamic instruction conversion mode, if the current instruction to be converted is a memory access instruction and the memory address is in the first shared memory list, converting the current instruction to be converted into an instruction sequence consisting of a locking instruction sequence, a second ARM instruction sequence and an unlocking instruction sequence, executing step 3 when the current instruction to be converted is a read operation, and executing step 4 when the current instruction to be converted is a write operation, otherwise, executing step 5;
step 3, if the current computing core corresponds to the invalid state of the cache line, sending a read request to the bus to read data, and setting the state of the cache line to be a shared state or an exclusive state after the reading is completed;
Step 4, if the current computing core corresponds to the exclusive state of the cache behavior, modifying the data in the cache line and setting the data as a modified state; if the shared state is the shared state, the invalid message is sent to other computing cores, the invalid message is set to be a modified state after the response confirmation, and then the write operation is executed;
and step 5, if the executable file is not completed, executing the step 2, otherwise ending the flow.
2. The method for optimizing memory conflict of instruction translation according to claim 1, wherein the determining the first x86 instruction sequence related to the x86 system call by using instruction flow analysis in step 1 is that after identifying the system call number of the shared memory related x86 system call by using instruction flow analysis, the first x86 instruction sequence is formed from the instruction containing the system call number to the soft interrupt trigger instruction.
3. The method for optimizing an instruction conversion memory conflict as claimed in claim 1, wherein the method for determining the shared memory between processes to form the first shared memory list by analyzing the first ARM instruction sequence obtained by converting the first x86 instruction sequence in step 1 is as follows:
If the first ARM instruction sequence comprises a system call for creating or acquiring a shared memory segment, an ARM instruction A for saving a shared memory key value is added after the first ARM instruction sequence, if the first ARM instruction sequence comprises a system call for adding the shared memory segment to an address space of a calling process, an ARM instruction B for saving a shared memory address is added after the first ARM instruction sequence, a process ID of a current process, the shared memory key value and the shared memory address are acquired, a shared memory list is formed by the shared memory key value, the shared memory address and the process ID for creating the shared memory, and the shared memory list is used as the first shared memory list.
4. The method for optimizing command conversion memory conflict as claimed in claim 3, wherein the process ID in the shared memory list is traversed as a target process, page table entries of all shared memory pages of the target process are obtained and recorded as original page table entries, the mapping relation from the shared memory pages to the physical memory pages is released, information of the original page table entries is stored in the page table of the target process, and the shared memory address which triggers the page shortage abnormality and the abnormal address belongs to the shared memory list is stored in the first shared memory list.
5. The method of claim 1, wherein the locking instruction sequence in step 2 is composed of an instruction to set a lock variable, an atomic operation LDREXB to load exclusive bytes, and an atomic operation STREXB to store exclusive bytes.
6. The method of claim 1, wherein the unlocking instruction sequence in step 2 is composed of an instruction for modifying a lock variable.
7. The method according to claim 1, wherein the second ARM instruction sequence in step 2 is composed of ARM instructions having the same function as the current instruction to be converted and other access related instructions.
8. The method of claim 1, wherein the step 3 of sending the read request to the bus is reading data from a cache line or a memory of another computing core.
9. The method of claim 1, wherein said x86 system call in step 1 comprises shmget, shmat, shmdt and shmctl.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510377539.7A CN119883435B (en) | 2025-03-28 | 2025-03-28 | A memory access conflict optimization method for instruction conversion based on instruction stream feature recognition |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510377539.7A CN119883435B (en) | 2025-03-28 | 2025-03-28 | A memory access conflict optimization method for instruction conversion based on instruction stream feature recognition |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119883435A true CN119883435A (en) | 2025-04-25 |
| CN119883435B CN119883435B (en) | 2025-06-20 |
Family
ID=95439893
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510377539.7A Active CN119883435B (en) | 2025-03-28 | 2025-03-28 | A memory access conflict optimization method for instruction conversion based on instruction stream feature recognition |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119883435B (en) |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5758183A (en) * | 1996-07-17 | 1998-05-26 | Digital Equipment Corporation | Method of reducing the number of overhead instructions by modifying the program to locate instructions that access shared data stored at target addresses before program execution |
| US6865645B1 (en) * | 2000-10-02 | 2005-03-08 | International Business Machines Corporation | Program store compare handling between instruction and operand caches |
| CN105094840A (en) * | 2015-08-14 | 2015-11-25 | 浪潮(北京)电子信息产业有限公司 | Atomic operation implementation method and device based on cache consistency principle |
| CN118820134A (en) * | 2024-09-20 | 2024-10-22 | 北京卡普拉科技有限公司 | Cache consistency optimization method in automatic thread-level parallelization |
| CN118860910A (en) * | 2024-06-28 | 2024-10-29 | 超聚变数字技术有限公司 | A method for accessing a shared memory space and a computing device |
| CN119440626A (en) * | 2025-01-09 | 2025-02-14 | 北京麟卓信息科技有限公司 | A prefetch instruction conversion optimization method based on memory access mode virtualization |
| CN119576412A (en) * | 2025-02-09 | 2025-03-07 | 北京麟卓信息科技有限公司 | A data synchronization optimization method for x86 instruction conversion in ARM multi-core |
| CN119690517A (en) * | 2025-02-25 | 2025-03-25 | 北京麟卓信息科技有限公司 | A reinforcement learning-based memory access optimization method for ARM multi-core instruction conversion |
-
2025
- 2025-03-28 CN CN202510377539.7A patent/CN119883435B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5758183A (en) * | 1996-07-17 | 1998-05-26 | Digital Equipment Corporation | Method of reducing the number of overhead instructions by modifying the program to locate instructions that access shared data stored at target addresses before program execution |
| US6865645B1 (en) * | 2000-10-02 | 2005-03-08 | International Business Machines Corporation | Program store compare handling between instruction and operand caches |
| CN105094840A (en) * | 2015-08-14 | 2015-11-25 | 浪潮(北京)电子信息产业有限公司 | Atomic operation implementation method and device based on cache consistency principle |
| CN118860910A (en) * | 2024-06-28 | 2024-10-29 | 超聚变数字技术有限公司 | A method for accessing a shared memory space and a computing device |
| CN118820134A (en) * | 2024-09-20 | 2024-10-22 | 北京卡普拉科技有限公司 | Cache consistency optimization method in automatic thread-level parallelization |
| CN119440626A (en) * | 2025-01-09 | 2025-02-14 | 北京麟卓信息科技有限公司 | A prefetch instruction conversion optimization method based on memory access mode virtualization |
| CN119576412A (en) * | 2025-02-09 | 2025-03-07 | 北京麟卓信息科技有限公司 | A data synchronization optimization method for x86 instruction conversion in ARM multi-core |
| CN119690517A (en) * | 2025-02-25 | 2025-03-25 | 北京麟卓信息科技有限公司 | A reinforcement learning-based memory access optimization method for ARM multi-core instruction conversion |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119883435B (en) | 2025-06-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR101470713B1 (en) | Mechanisms to accelerate transactions using buffered stores | |
| US11379324B2 (en) | Persistent memory transactions with undo logging | |
| US8239633B2 (en) | Non-broadcast signature-based transactional memory | |
| KR100233207B1 (en) | Cache flush device and calculator system with the device | |
| Bobba et al. | Tokentm: Efficient execution of large transactions with hardware transactional memory | |
| TWI498733B (en) | Cache metadata for implementing bounded transactional memory | |
| JP5462883B2 (en) | Read and write monitoring attributes in transactional memory (TM) systems | |
| KR101388865B1 (en) | Performing mode switching in an unbounded transactional memory (utm) system | |
| JPS60221851A (en) | Data processor and memory access controller used therefor | |
| JPH0784851A (en) | Shared data management method | |
| CN111201518B (en) | Apparatus and method for managing capability metadata | |
| CN113220490A (en) | Transaction persistence method and system for asynchronous write-back persistent memory | |
| US7523260B2 (en) | Propagating data using mirrored lock caches | |
| US20090063783A1 (en) | Method and appartaus to trigger synchronization and validation actions upon memory access | |
| JP2829115B2 (en) | File sharing method | |
| CN119883435B (en) | A memory access conflict optimization method for instruction conversion based on instruction stream feature recognition | |
| US6892257B2 (en) | Exclusive access control to a processing resource | |
| US11176042B2 (en) | Method and apparatus for architectural cache transaction logging | |
| JP3340047B2 (en) | Multiprocessor system and duplicate tag control method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |