HK40033747A

HK40033747A - Logging cache influxes by request to a higher-level cache

Info

Publication number: HK40033747A
Application number: HK62020022529.5A
Authority: HK
Inventors: 莫拉 J·; 加布里杰尔斯基 H·
Original assignee: 微软技术许可有限责任公司
Priority date: 2018-02-23
Filing date: 2019-02-14
Publication date: 2021-04-16

Description

Logging cache inflow to higher level caches by request

Background

When writing code during development of a software application, developers typically spend a significant amount of time "debugging" the code to find runtime and other source code errors. In doing so, a developer may employ several methods to reproduce and locate source code bugs (bugs), e.g., observing the behavior of a program based on different inputs, inserting debugging code (e.g., printing variable values, tracing execution branches, etc.), temporarily deleting portions of code, etc. Tracking runtime errors to ascertain code errors may take a significant portion of application development time.

To assist developers in the code debugging process, many types of debugging applications ("debuggers") have been developed. These tools provide developers with the ability to track (trace) the execution of computer code, visualize it, and make changes to it. For example, a debugger may visualize execution of code instructions, may present code variable values at different times during code execution, may enable a developer to change code execution paths, and/or may enable a developer to set "breakpoints" and/or "watchpoints" on code elements of interest (which, when reached during execution, cause execution of code to be halted), among other things.

Emerging forms of debugging applications implement "time travel", "reverse", or "history" debugging. With "time travel" debugging, execution of a program (e.g., an executable entity such as a thread) is recorded/traced by a trace application into one or more trace files. These trace files may then be used for later execution of the replay program for forward and backward analysis. For example, a "time-travel" debugger may enable a developer to set forward breakpoints/watchpoints (as with a conventional debugger) as well as reverse breakpoints/watchpoints.

Several considerations may be taken into account when recording the trace file. Most notably, there is an inherent tradeoff between the robustness of the trace data recorded and the overhead generated by the trace program. These tradeoffs are primarily in the size of the trace file and the performance impact on the execution of the traced program. Also, since tracking may be done with hardware assistance (or entirely in software), there may also be hardware design and other hardware cost considerations.

Disclosure of Invention

Embodiments described herein relate to mechanisms for creating bit-accurate "time travel" trace records by a processor using hardware assistance. These mechanisms are based on using at least two levels (tier) or layers (layer) of processor cache to track the effect of execution across multiple processing units. A mechanism modifies hardware and/or microcode of a processor such that when it detects an inflow (i.e., a cache miss) to an internal or "lower level" processor cache based on activity of a tracked processing unit, the mechanism examines one or more external or "upper level" shared processor caches to determine whether the inflow's data has been logged (log) on behalf of another tracked processing unit. Another mechanism modifies the hardware and/or microcode of the processor such that one or more cache layers are configured to receive logging requests from the underlying cache layer(s) and use its knowledge of the logged cache lines to determine how the influx to the underlying cache layer (if any) should be logged. Either mechanism may enable the inflow to be logged by reference to a previous log entry, and each mechanism may be extended to an "N" level cache. Using either mechanism to record trace files may require only modest processor modifications, and they may reduce the performance impact of trace recording and the size of trace files by orders of magnitude compared to previous trace recording methods.

A first embodiment relates to computing device(s) comprising a plurality of processing units, a plurality of N-level caches, and an (N + i) -level cache. The (N + i) level cache is associated with two or more of the plurality of N level caches and is configured as a backing store for the plurality of N level caches. In these embodiments, the computing device(s) includes control logic that configures the computing device(s) to detect an inflow to a first N-level cache of the plurality of N-level caches, and wherein the inflow comprises data stored at the memory location. The control logic further configures the computing device(s) to check the (N + i) -level cache to determine whether data for the memory location has been previously logged on behalf of the second processing unit. The control logic further configures the computing device(s) to perform, based on the checking, one of: causing the data for the memory location to be logged on behalf of the first processing unit by referencing log data that has been previously logged on behalf of the second processing unit (i.e., when the data for the memory location has been previously logged on behalf of the second processing unit), or (ii) causing the data for the memory location to be logged on value on behalf of the first processing unit (i.e., when the data for the memory location has not been previously logged on behalf of the second processing unit).

A second embodiment relates to a computing device(s) comprising a plurality of processing units and a plurality of caches arranged into a plurality of cache layers. The plurality of caches includes a plurality of first caches within a first cache tier, and one or more second caches within a second cache tier. A particular second cache in the second cache tier serves as a backing store for at least a particular first cache in the first cache tier. In these embodiments, the computing device(s) include control logic that configures at least a particular second cache to receive log record requests referencing a particular memory address from a particular first cache. Based on the request, the particular second cache determines whether a cacheline corresponding to the memory address is present in the particular second cache. When no cache line exists in the particular second cache, (i) when there is no third cache that participates in logging and acts as a backing store for at least the particular second cache, the second cache causes the cache line to be logged; or (ii) when the third cache does exist, the second cache forwards the request to the third cache.

When a cache line exists in the particular second cache, (i) the second cache causes the cache line to be logged when the cache line is not determined by the particular second cache to be logged, or is determined by the particular second cache to be logged, but the particular second cache has not determined that the first cache knows the current value stored in the cache line of the particular second cache; or (ii) the second cache determines that the cache line does not need to be logged when the cache line is determined by the particular second cache to be logged and the first cache is determined to know the current value stored in the cache line of the particular second cache.

Any embodiments described herein may also be implemented as method(s) performed by computing device(s) (e.g., such as a microprocessor) and/or computer-executable instructions (e.g., processor microcode) stored on a hardware storage device and executable to perform the method(s).

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:

FIG. 1 illustrates an example computing environment that facilitates recording "bit-accurate" execution tracing across multiple processing units using at least two tiers or layers of processor cache;

FIG. 2A illustrates an example computing environment including multiple levels of caching;

FIG. 2B illustrates one example of a cache;

FIG. 3 illustrates a flow diagram of an example method for trace recording based on: recording an inflow to a lower level cache by referring to previous log data based on knowledge of one or more upper level caches;

FIG. 4A illustrates an example shared cache, wherein each cache line includes one or more additional accounting bits;

FIG. 4B illustrates one example of a shared cache including one or more reserved cache lines for storing accounting bits applied to regular cache lines;

FIG. 5 illustrates an example of a set-associative mapping between system memory and cache;

FIG. 6 illustrates a flow diagram of an example method of an upper cache layer to determine how to log an inflow of a lower cache layer based on a logging request of the lower cache layer;

FIG. 7 illustrates a flow diagram of an example method for managing a logging state of a cache line when a processing unit transitions between enabling logging and disabling logging;

FIG. 8 illustrates a flow diagram of an example method for managing a journaling state of a cache line when a processing unit with disabled journaling receives the cache line for writing only from a parent cache; and

FIG. 9 illustrates a flow diagram of an example method for managing a journaling state of a cache line when a processing unit writes to the cache line that the processing unit is already in an "owned" cache coherency protocol state.

Detailed Description

Embodiments described herein relate to mechanisms for creating bit-accurate "time travel" trace records by a processor using hardware assistance. These mechanisms are based on using at least two levels or layers of processor cache to track the effect of execution across multiple processing units. One mechanism modifies the hardware and/or microcode of a processor such that when it detects an inflow to an internal or "lower level" processor cache (i.e., a cache miss) based on the tracked processing unit's activity, it checks one or more external or "upper level" shared processor caches to determine whether the inflow of data has been logged by another tracked processing unit. Another mechanism modifies the hardware and/or microcode of the processor such that one or more cache layers are configured to receive logging requests from the underlying cache layer(s) and use its knowledge of the logged cache lines to determine how the influx to the underlying cache layer (if any) should be logged. Either mechanism may enable the inflow to be logged by reference to a previous log entry, and each mechanism may be extended to "N" level caches. Using either mechanism to record trace files may require only modest processor modifications, and they may reduce the performance impact of trace recording as well as trace file size by orders of magnitude when compared to previous trace recording methods.

FIG. 1 illustrates an example computing environment 100, the example computing environment 100 recording "bit accurate" execution traces across multiple processing units using at least two tiers or layers of processor cache. As depicted, embodiments may include or utilize a special purpose or general-purpose computer system 101, the special purpose or general-purpose computer system 101 including computer hardware, such as, for example, one or more processors 102, a system memory 103, one or more data stores 104, and/or input/output hardware 105.

Embodiments within the scope of the present invention include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by computer system 101. Computer-readable media storing computer-executable instructions and/or data structures are computer storage devices. Computer-readable media bearing computer-executable instructions and/or data structures are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can include at least two distinct categories of computer-readable media: computer storage devices and transmission media.

A computer storage device is a physical hardware device that stores computer-executable instructions and/or data structures. Computer storage devices include various computer hardware, such as RAM, ROM, EEPROM, solid state drives ("SSD"), flash memory, phase change memory ("PCM"), optical disk storage, magnetic disk storage or other magnetic storage devices, or any other hardware device(s) that can be used to store program code in the form of computer-executable instructions or data structures and which can be accessed and executed by computer system 101 to implement the disclosed functionality. Thus, for example, a computer storage device may include the depicted system memory 103, the depicted data store 104, or other storage (such as on-processor storage) as discussed later that may store computer-executable instructions and/or data structures.

Transmission media can include a network and/or data links which can be used to carry program code in the form of computer-executable instructions or data structures and which can be accessed by computer system 101. A "network" is defined as one or more data links that enable the transfer of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer system, the computer system may view the connection as a transmission medium. Combinations of the above should also be included within the scope of computer-readable media. For example, the input/output hardware 105 may include hardware (e.g., a network interface module (e.g., "NIC")) that connects networks and/or data links that may be used to carry program code in the form of computer-executable instructions or data structures.

In addition, program code in the form of computer-executable instructions or data structures may be transferred automatically from transmission media to computer storage devices (and vice versa) upon reaching various computer system components. For example, computer-executable instructions or data structures received over a network or data link may be buffered in RAM within a NIC (e.g., input/output hardware 105) and then ultimately transferred to system memory 103 and/or a less volatile data storage device (e.g., data store 104) at computer system 101. Thus, it should be understood that computer storage devices can be included in computer system components that also (or even primarily) utilize transmission media.

Computer-executable instructions comprise instructions and data that, for example, when executed at processor(s) 102, cause computer system 101 to perform a particular function or group of functions (group). The computer-executable instructions may be, for example, binaries, intermediate format instructions (e.g., assembly language), or even source code.

Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including: personal computers, desktop computers, laptop computers, message processors, handheld devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. Thus, in a distributed system environment, a computer system may include multiple component computer systems. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Those skilled in the art will also appreciate that the present invention may be practiced in cloud computing environments. The cloud computing environment may be distributed, but this is not required. When distributed, the cloud computing environment may be distributed internationally within an organization and/or have components owned across multiple organizations. In this specification and the appended claims, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services). The definition of "cloud computing" is not limited to any of the other numerous advantages that may be obtained from such a model when deployed in place.

The cloud computing model may be composed of various characteristics, such as on-demand self-service, wide network access, resource pooling, fast elasticity, measured services, and the like. The cloud computing model may also take the form of various service models, such as, for example, software as a service ("SaaS"), platform as a service ("PaaS"), and infrastructure as a service ("IaaS"). The cloud computing model may also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so on.

Some embodiments (e.g., cloud computing environments) may include a system having one or more hosts that are each capable of running one or more virtual machines. During operation, the virtual machine will emulate an operating computing system, thereby supporting the operating system and perhaps one or more other applications. In some embodiments, each host includes a hypervisor that emulates virtual resources of a virtual machine using physical resources abstracted from the perspective of the virtual machine. The hypervisor also provides proper isolation between the virtual machines. Thus, from the perspective of any given virtual machine, the hypervisor provides the illusion that the virtual machine is interfacing with a physical resource, even though the virtual machine only interfaces with the appearance of the physical resource (e.g., the virtual resource). Examples of physical resources include processing power, memory, disk space, network bandwidth, media drives, and so forth.

Fig. 1 includes a simplified representation of the internal hardware components of processor(s) 102. As shown, each processor 102 includes a plurality of processing units 102 a. Each processing unit may be physical (i.e., a physical processor core) and/or logical (i.e., a logical core provided by a physical core that supports hyper-threading, where more than one application thread executes at the physical core). Thus, for example, even if the processor 102 includes only a single physical processing unit (core) in some embodiments, it may include two or more logical processing units 102a represented by the single physical processing unit.

Each processing unit 102a executes processor instructions defined by applications (e.g., tracker 104a, debugger 104b, operating core 104c, application 104d, etc.), and the instructions are selected from a predetermined processor Instruction Set Architecture (ISA). The particular ISA of each processor 102 varies based on the processor manufacturer and processor model. Common ISAs include the IA-64 and IA-32 architectures from INTEL, Inc., the AMD64 architecture from ADVANCED MICRO DEVICES, Inc., and the various ADVANCED RISC Machine ("ARM") architectures from ARMHOLDINGS, PLC, although numerous other ISAs exist and may be used by the present invention. Generally, an "instruction" is the smallest externally visible (i.e., external to the processor) unit of code executable by the processor.

Each processing unit 102a obtains processor instructions from one or more processor caches 102b and executes the processor instructions based on data in the cache(s) 102b, based on data in registers 102d, and/or without input data. In general, each cache 102b is a small amount (i.e., small relative to the typical amount of system memory 103) of random access memory that stores a copy of portions of a backing store (such as system memory 103 and/or another cache in cache(s) 102b) on the processor. For example, when executing application code 103a, one or more caches 102b contain some portion of application runtime data 103 b. If processing unit(s) 102a requests data that has not yet been stored in a particular cache 102b, a "cache miss" occurs, and the data is fetched from system memory 103 or another cache, possibly "evicting" some other data from that cache 102 b.

Typically, processor cache(s) 102b are divided into separate levels, tiers, or levels, such as tier 1(L1), tier 2(L2), tier 3(L3), and so forth. Depending on the processor implementation, the hierarchy may be part of the processor 102 itself (e.g., L1 and L2) and/or may be separate from the processor 102 (e.g., L3). Thus, the cache(s) 102b of fig. 1 may include one of these levels (L1), or may include multiple of these levels (e.g., L1 and L2, or even L3). To further understand these concepts, FIG. 2A illustrates an example environment 200 that exposes a multi-level cache. In fig. 2A, there are two processors 201a and 201b (e.g., each corresponding to a different processor 102 of fig. 1) and a system memory 202 (e.g., corresponding to system memory 103 of fig. 1). In the example environment 200, each processor 201 includes four physical processing units (i.e., units A1-A4 for processor 201a and units B1-B4 for processor 210B).

The example environment 200 also includes a three-level cache hierarchy within each processing unit 201. Environment 200 is merely one example layout and is not limited to a cache hierarchy in which embodiments herein may operate. In environment 200, at a lowest or innermost level, each processing unit is associated with its own dedicated L1 cache (e.g., L1 cache "L1-A1" in processor 201a for unit A1, L1 cache "L1-A2" in processor 201a for unit A2, etc.). Moving one tier upward, each processing unit 201 includes two L2 caches (e.g., L2 cache "L2-A1" in processor 201a acting as a backing store for L1 caches L1-A1 and L1-A2, "L2 cache" L1-A2 "in processor 201a acting as a backing store for L1 caches L1-A3 and L1-A4, etc.). Finally, at the highest or outermost level, each processing unit 201 includes a single L3 cache (e.g., L3 cache "L3-A" in processor 201a, which serves as a backing store for L2 caches L2-A1 and L2-A2, and L3 cache "L3-B" in processor 201B, which serves as a backing store for L2 caches L2-B1 and L2-B2). As shown, system memory 202 serves as a backing store for L3 caching L3-A and L3-B.

As shown in FIG. 2A, when multiple cache levels are used, processing unit(s) 102A typically interact directly with the lowest level (L1). In most cases, data flows between layers (e.g., in reads, the L3 cache interacts with system memory 103 and uses data for the L2 cache, while the L2 cache uses data for the L1 cache). When processing unit 102a performs a write operation, the caches coordinate to ensure that those caches that have affected the data shared between processing unit(s) 102a no longer have data. This coordination is performed using CCP.

The caches in environment 200 may thus be considered "shared" caches. For example, each of the L2 and L3 caches serve multiple processing units within a given processor 201, and are therefore shared by the processing units. Even though each cache corresponds to a single processing unit, the L1 caches in a given processor 201 may be viewed as shared overall, as the individual L1 caches may coordinate with each other (i.e., via CCP) to ensure coherency (i.e., so that the memory location of each cache is viewed consistently across all L1 caches). Similarly, the L2 caches within each processor 201 may be coordinated via the CCP. Additionally, if the processor 201 supports hyper-threading, each individual L1 cache may be considered to be shared by two or more logical processing units, and thus may be "shared" even at an individual level.

Typically, each cache includes multiple "cache lines". Each cache line stores a block of memory from its backing store (e.g., system memory 202 or a higher level cache). For example, fig. 2B illustrates an example of at least a portion of a cache 203, cache 203 including a plurality of cache lines 206, each cache line 206 including at least an address portion 204 and a value portion 205. The address portion 204 of each cache line 206 is configured to store an address in the system memory 202 to which the cache line corresponds, and the value portion 205 initially stores a value received from the system memory 202. The value portion 205 may be modified by the processing unit and eventually evicted back to the backing store. As indicated by the ellipses, the cache 203 may include a large number of cache lines. For example, a contemporary 64-bit INTEL processor may contain a separate L1 cache having 512 or more cache lines. In such a cache, each cache line is typically available to store a 64 byte (512 bit) value of a 6 byte (48 bits) to 8 byte (64 bits) memory address. As illustrated in the perspective of FIG. 2A, cache size typically increases with each level (i.e., L2 cache is typically larger than L1 cache, L3 cache is typically larger than L2 cache, etc.).

The address stored in the address portion 204 of each cache line 206 may be a physical address such as a real memory address in the system memory 202. Alternatively, the address stored in the address portion 204 may be a virtual address, which is an address that is mapped to a physical address to provide abstraction (e.g., using an operating system managed page table). Such abstraction may be used, for example, to facilitate memory isolation between different processes executing at the processor(s) 102, including isolation between user-mode processes and kernel-mode processes associated with the operating system kernel 104 b. When a virtual address is used, processor 102 may include a Translation Lookaside Buffer (TLB)102f (typically part of a Memory Management Unit (MMU)), which Translation Lookaside Buffer (TLB)102f maintains the most recently used memory address mapping between physical and virtual addresses.

The cache(s) 102b may include a code cache portion and a data cache portion. When executing application code 103a, the code portion(s) of cache(s) 102b may store at least a portion of the processor instructions stored in application code 103a and the data portion(s) of cache(s) 102b may store at least a portion of the data structure of application runtime data 103 b. In addition, a cache may be inclusive, exclusive, or include both inclusive and exclusive behavior. For example, in an inclusive cache, the L3 tier typically stores supersets of data in the L2 tier below the L3 tier, while the L2 tier stores supersets of the L1 tier below them. In an exclusive cache, the layers may be disjoint, e.g., if there is data in the L3 cache that is needed by the L1 cache, they may exchange information such as data, addresses, etc.

Returning to FIG. 1, each processor 102 also includes microcode 102c, microcode 102c including control logic (i.e., executable instructions) that controls the operation of processor 102, and which generally serves as an interpreter between the hardware of the processor and the processor ISA exposed by processor 102 to executing applications. The microcode 102 is typically embodied in an on-processor storage device such as ROM, EEPROM, or the like.

Register 102d is a hardware-based storage location that is defined based on the ISA of processor(s) 102 and is read and/or written to by processor instructions. For example, the register 102d is typically used to store values retrieved from the cache(s) 102b for use by instructions, to store results of executing instructions, and/or to store status (status) or state (such as some side effect of executing instructions (e.g., sign of value change, value reaching zero, occurrence of carry, etc.), processor cycle count, and so forth. Thus, some registers 102d may include a "flag" for signaling some state change caused by execution of a processor instruction. In some embodiments, the processor 102 may also include control registers for controlling different aspects of processor operation. Although FIG. 1 depicts registers 102d as a single block, it will be understood that each processing unit 102a typically includes one or more corresponding register sets 102d that are specific to that processing unit.

In some embodiments, processor(s) 102 may include one or more buffers 102 e. As will be discussed below, buffer(s) 102e may be used as temporary storage locations for trace data. Thus, for example, the processor(s) 102 may store portions of trace data in the buffer(s) 102e and flush that data to the trace data store 104e at the appropriate time (e.g., when there is available memory bus bandwidth and/or idle processor cycles).

As described above, the processor operates on cache(s) 102b according to one or more CCPs. In general, CCP defines how coherency between data among the various caches 102b is maintained when the various processing units 102a read and write data from the various caches 102b, and how to ensure that the various processing units 102a always read valid data from a given location in the cache(s) 102 b. CCP is associated with and enables a memory model defined by the ISA of processor 102.

Examples of generic CCPs include the MSI protocol (i.e., modified, shared, and invalid), the MESI protocol (i.e., modified, exclusive, shared, and invalid), and the MOESI protocol (i.e., modified, owned, exclusive, shared, and invalid). Each of these protocols defines a state for a separate location (e.g., row) in the cache(s) 102 b. The "modified" cache location contains data that has been modified in the cache(s) 102b and therefore may not be consistent with corresponding data in a backing store (e.g., system memory 103 or another cache). When a location with "modified" status is evicted from cache(s) 102b, the general purpose CCP needs a cache to ensure that its data is written back to the backing store, or that responsibility is taken over by another cache. The "shared" cache location contains data that has not been modified from data in the backing store, exists in a read-only state, and is shared by processing unit(s) 102 a. Cache(s) 102b may evict the data without writing it to a backing store. An "invalid" cache location contains no valid data and may be considered empty and available to store data for a cache miss. An "exclusive" cache location contains data that matches a backing store and is used only by a single processing unit 102 a. The data may be changed to a "shared" state at any time (i.e., in response to a read request), or may be changed to a "modified" state when it is written. An "owned" cache location is shared by two or more processing units 102a, but one of the processing units has exclusive rights to make changes to it. When the process makes a change, it may notify the other processing units because they may need to invalidate or update their own caches based on the CCP implementation.

The data store 104 may store computer-executable instructions representing application programs such as, for example, a tracker 104a, a debugger 104b, an operating system kernel 104c, and an application 104d (e.g., an application that is the subject of tracking by the tracker 104 a). While these programs are executing (e.g., using processor(s) 102), system memory 103 may store corresponding runtime data, such as runtime data structures, computer-executable instructions, and so forth. Thus, fig. 1 illustrates system memory 103 as including application code 103a and application runtime data 103b (e.g., each corresponding to application 104 g). The data store 104 may further store data structures, such as tracking data stored within one or more tracking data stores 104 e. As indicated by the ellipses 104f, the data store 104 may also store other computer-executable instructions and/or data structures.

The tracker 104a may be used to record bit-accurate execution tracking of one or more entities (e.g., one or more threads of the application 104d or the kernel 104 c) and store the tracking data in the tracking data store 104 e. In some embodiments, the tracker 104a is a stand-alone application, while in other embodiments, the tracker 104a is integrated into another software component (such as the kernel 104c, a hypervisor, a cloud structure, etc.). Although the trace data store 104e is depicted as part of the data store 104, the trace data store 104e may also be embodied, at least in part, in the system memory 103, in the cache(s) 102b, in the buffer(s) 102e, or at some other storage device.

As mentioned, the tracker 104a records the bit-accurate performance tracking of one or more entities. As used herein, a "bit accurate" trace is a trace that includes the following data: this data is sufficient to enable code that was previously executed at one or more processing units 102a to be replayed such that it executes in substantially the same manner as during tracing when replayed. There are a number of ways in which the tracker 104a may record bit-accurate tracks, each of which has various advantages and disadvantages (e.g., in terms of tracking overhead, tracking file size, required processor modifiers, etc.). Some specific embodiments for recording such data are discussed later in connection with fig. 3-9.

Regardless of the recording method used by the tracker 104a, it may record the tracking data into one or more tracking data stores 104 e. By way of example, the trace data store 104e may include one or more trace files, one or more regions of system memory 103, one or more regions of processor cache 102b (e.g., an L2 or L3 cache), a buffer 102d in the processor 102, or any combination or plurality thereof. The trace data store 104e may include one or more trace data streams (streams). In some embodiments, for example, multiple entities (e.g., processes, threads, etc.) may each be traced to a separate trace file or trace data stream within a given trace file. Alternatively, the data packets corresponding to each entity may be marked such that they are identified as corresponding to that entity. If multiple related entities (e.g., multiple threads of the same process) are to be traced, the trace data for each entity may be traced independently (enabling them to replay independently), but any event that may be ordered across entities (e.g., accessing shared memory) may be identified with a sequence number (e.g., a monotonically increasing number) that is global across independent traces. The trace data store 104e may be configured for flexible management, modification, and/or creation of trace data streams. For example, modifications to an existing trace data stream may involve modifications to an existing trace file, replacements of portions of trace data within an existing file, and/or creation of a new trace file that includes the modifications.

In some implementations, the tracker 104a may be continuously appended to the trace data stream(s) such that the trace data continuously grows during tracing. However, in other implementations, the trace data stream may be implemented as one or more circular buffers. In such an implementation, as new trace data is added to the trace data store 104e, the oldest trace data is removed from the data stream(s). Thus, when trace data streams are implemented as buffer(s), they contain the most recently performed rolling traces at the traced process (es). The use of a circular buffer may enable the tracker 104a to participate in tracking at all times even in a production system. In some implementations, tracking may be enabled and disabled at virtually any time, such as by setting or clearing one or more bits in one or more control registers. In this way, whether trace to a circular buffer or appended to a conventional trace data stream, the trace data may include gaps between periods during which trace is enabled for one or more processing units 102 a.

Debugger 104b may be used to consume (e.g., replay) trace data generated by tracker 104a into trace data store 104e in order to assist a user in performing debugging actions on the trace data (or a derivative thereof). For example, debugger 104b can present one or more debugging interfaces (e.g., user interfaces and/or application programming interfaces), replay before execution of one or more portions of application 104d, set breakpoints/watchpoints (including reverse breakpoints/watchpoints), enable queries/searches on trace data, and so forth.

Returning to tracker 104a, in embodiments herein, tracker 104a effectively records bit-accurate execution tracking of applications 104d and/or operating system kernel 104c using cache(s) 102b of processor 102. These examples were established based on the following observations of the inventors: the processor 102 (including the cache(s) 102b) forms a semi-closed or quasi-closed system. For example, once portions of data for a process (i.e., code data and runtime application data) are loaded into the cache(s) 102b, the processor 102 may run on its own in a semi-closed or quasi-closed system within a time burst without any input. In particular, once the cache(s) 102b are loaded with data, the one or more processing units 102a execute instructions from the code portion(s) of the cache(s) 102b using runtime data stored in the data portion(s) of the cache(s) 102b and using the registers 102 d.

When processing unit 102a requires some inflow of information (e.g., because an instruction it is executing, about to execute, or may execute accesses code or runtime data that is not already in cache(s) 102b), "cache misses" occur and this information is brought into cache(s) 102b from system memory 103. For example, if a data cache miss occurs when an executed instruction performs a memory operation at a memory address within the application runtime data 103b, data from that memory address is brought into one of the cache lines of the data portion of the cache(s) 102 b. Similarly, if a code cache miss occurs when an instruction performs a memory operation at memory address application code 103a stored in system memory 103, code from that memory address is brought into one of the cache lines of the code portion(s) of cache(s) 102 b. The processing unit 102a then continues execution using the new information in the cache(s) 102b until the new information is introduced into the cache(s) 102b again (e.g., due to another cache miss or a uncached read).

The inventors have also observed that in order to record a bit-accurate representation of the execution of an application, the tracker 104a may record enough data that the incoming amount of information can be reproduced into the cache(s) 102b when the processing unit executes the thread(s) of the application. For example, one method of recording these inflows operates on a per processing unit basis and at the innermost cache layer (e.g., L1). The method may involve, for each processing unit being tracked, recording all cache misses and uncached reads associated with the processing unit's L1 cache (i.e., reads from hardware components and non-cacheable memory), and the time each data fragment was introduced into the processing unit's L1 cache during execution (e.g., using an executed instruction count or some other counter). If there are events that can be ordered across processing units (e.g., access to shared memory), these events can be logged across the generated data stream (e.g., by using a monotonically increasing (or decreasing) number (MIN) across the data stream).

However, since the L1 caching layer may include multiple different L1 caches, each associated with a different physical processing unit (e.g., as shown in fig. 2A), logging in this manner may record duplicate data, thus strictly requiring more data for "full fidelity" tracking. For example, if multiple physical processing units read from the same memory location (which may often occur in multi-threaded applications), the method may log cache misses for the same memory location and data for each of the multiple physical processing units. Notably, as used herein, a "full-fidelity" trace is any trace that contains sufficient information to enable a full replay of the traced entity (even though a particular "full-fidelity" trace may actually contain less data that encapsulates the same information than may be recorded using alternative trace techniques).

To further reduce trace file size, the inventors developed improved recording techniques that utilize one or more upper level caches to avoid recording at least a portion of the duplicate data. Instead, these improved techniques may log by reference to previously logged data, or in many cases avoid logging.

Logging cache misses at a lower cache level based on a processor checking knowledge of one or more upper cache levels

In a first embodiment, a processor detects an inflow (i.e., a cache miss) to an internal or "lower" processor cache (e.g., L1) based on activity of a first processing unit (such as a read from a particular memory address), and then checks one or more external or "upper" shared processor caches to determine whether the same inflow of data (i.e., the same memory address and the same value read by the first processing unit) has been logged on behalf of a second tracked processing unit. If so, the processor may log the later inflow through the first process by referring to the previous inflow of the second processing unit, if possible.

To understand these embodiments, note that in most environments, the upper level cache is larger than the lower level cache below it, and it is typically a backing store of multiple lower level caches. For example, in the example environment of fig. 2A, each L2 cache is a backing store of two L1 caches, and each L3 cache is a backing store of two L2 caches (and extended to four L1 caches). Thus, the upper-level caches may retain knowledge about multiple lower-level caches (e.g., in FIG. 2A, L2 cache L1-A1 may retain knowledge about L1 caches L1-A1 and L1-A2, L2 cache L1-A2 may retain knowledge about L1 caches L1-A3 and L1-A4, and L3 cache L3-A may retain knowledge about L2 caches L2-A1 and L2-A1, and L1 caches L1-A1, L1-A2, L1-A3 and L1-A4). By leveraging knowledge of one or more upper cache layers, embodiments herein enable many opportunities to log the inflow caused by one processing unit by referencing the inflow that has been logged on behalf of other processing units.

According to these first embodiments, fig. 3 illustrates an example of a method 300 for trace recording, which trace recording is performed as follows: the inflow to the lower level cache is recorded by referencing previous log data based on knowledge of one or more upper level caches. Fig. 3 is now described in the context of fig. 1 and 2.

In particular, fig. 3 operates in an environment such as processor 102 or 201a, processor 102 or 201a including a plurality of processing units, a plurality of N-level caches, and an (N + i) -level cache associated with two or more of the plurality of N-level caches and configured as a backing store of the plurality of N-level caches. In method 300 (and in the claims), N and i are positive integers, i.e., N ≧ 1 such that N equals 1, 2, 3, etc.; and i ≧ 1, such that i equals 1, 2, 3, etc. For example, referring to processor 201a of FIG. 2A, the processor includes a plurality of processing units A1, A2, and so on. Processor 201a also includes a plurality of N-level caches L1-A1, L1-A2, etc. (i.e., where N equals 1). The processor 201a also includes an (N + i) level cache associated with two or more of the plurality of N-level caches and configured as a backing store of the plurality of N-level caches. For example, processor 201a includes (N + i) level caches L2-A1 (i.e., where N equals 1 and i equals 1) as a backup store for N level caches L1-A1 and L1-A2. In another example, processor 201a includes an (N + i) level cache L3-A as a backing store for N-level caches L1-A1, L1-A2, and so on (i.e., where N equals 1 and i equals 2). The processor 102/201a operates the method 300 based on control logic, such as microcode 102c and/or circuit logic.

As shown, method 300 includes an act 301 of detecting an inflow to an N-level cache during execution at a first processing unit, act 301. In some embodiments, act 301 includes detecting an inflow to a first N-level cache of the plurality of N-level caches, the inflow comprising data stored at the memory location. For example, based on activity of processing unit a1, such as a requested memory access to system memory 202 (e.g., due to normal or speculative execution of the first thread of application 104 c), a cache miss may occur in caches L1-a1 (i.e., when N equals 1). Thus, a line of cache L1-A1 obtains the data inflow (including the past-current value of the requested memory location). Depending on the cache attributes (e.g., which higher levels of hierarchy exist, whether the cache architecture is inclusive or exclusive, etc.) and the current cache state, the inflow may originate from system memory 202 or a higher level cache (e.g., L2-A1 and/or L3-A).

Method 300 also includes an act of checking 302, based on execution at the second processing unit, the (N + i) level cache to determine whether the influx of data has been logged. In some embodiments, act 302 includes: based on detecting the inflow to the first N-level cache, the (N + i) -level cache is checked to determine whether data for the memory location has been previously logged by the second processing unit. For example, if i equals 1, such that the (N + i) level cache includes an (N +1) level cache, processor 201 may check an L2 cache (such as L2-A1) (which has knowledge of cache L1-A2 and processing unit A2). This check may be used to determine whether the data for the memory location has been previously logged on behalf of processing unit a 2. For example, this data has been previously logged based on the previous execution of the second thread of application 104c at processing unit A2 causing a cache miss in cache L1-A2. In an alternative example, if i is equal to 2, such that the (N + i) level cache includes an (N +2) level cache, then processor 201 may check an L2 cache, such as cache L3-a (which has knowledge of all other caches in processor 201). This check may be used to determine whether data for the memory location has been previously logged on behalf of any of processing units A2-A4 (e.g., based on previous execution of one or more other threads of application 104c at one or more of processing units A2-A4 that caused a cache miss(s) in caches L1-A2, L1-A3, and/or L1-A4). Note that in this second example, the L2 cache may be skipped in the check.

As shown, act 302 may be repeated any number of times while incrementing the value of i each time. Although i is typically incremented by 1 each time, there may be embodiments in which i is incremented by a positive integer greater than 1. The effect of repeating act 302 is that incrementing i will check multiple upper level caches. For example, if i ═ 1, then when act 302 initially runs, processor 201 may check the L2 cache layer (e.g., L2-a1 and/or L2-a 2). If sufficient knowledge of the applicable memory location is not found in the L2 cache, the processor 201 may repeat act 302 with i ═ 2, checking the L3 cache layer (e.g., L3-a). This may continue for as many caching layers as the computing environment provides. If i is incremented by a value greater than 1, one or more cache layers along the way may be skipped. It will be appreciated that it may be beneficial to examine multiple caching layers in an architecture that provides an exclusive cache or a cache that exhibits mixed inclusive/exclusive behavior. This is because in these architectures, it may not be guaranteed that the outer cache layer contains a complete superset of the data in the inner cache layer(s).

In view of the foregoing, it will be appreciated that the method 300 may operate in an environment such as the processor 102 or 201a with i equal to 1, such that the (N + i) level cache comprises an (N +1) level cache, and the processor further comprises an (N +2) level cache configured as a backing store of the (N +1) level cache. In these environments, checking the (N +1) -level cache to determine whether the data for the memory location has been previously logged on behalf of the second processing unit (i.e., act 302) may include: it is determined that no cacheline in the (N +1) level cache corresponds to a memory location. In addition, the (N +2) level cache is checked to determine whether the data for the memory location has been previously logged on behalf of the second processing unit.

As shown, based on the results of act 302, the method includes act 303: logging the inflow by reference when the data has been logged; or act 304: when data has not been logged, the inflow is logged by value.

In some embodiments, act 303 includes: when the data for the memory location has been previously logged on behalf of the second processing unit, the data for the memory location is logged on behalf of the first processing unit by reference to the log data that has been previously logged on behalf of the second processing unit. Continuing with the above example, for example, if checking the (N +1) level cache L2-A1 and/or checking the (N +2) level cache L3-A results in the following determination: data/memory location has been logged on behalf of processing unit A2 (based on the inflow to cache L1-A2), then processor 201a may cause the inflow to cache L1-A1 to be logged on behalf of processing unit A1 by referencing the log entry created for processing unit A2. An example of how logging by reference is done will be given later.

Turning to alternative results of act 302, in some embodiments, act 304 includes: causing the data for the memory location to be logged by value on behalf of the first processing unit when the data for the memory location has not been logged on behalf of the second processing unit. For example, if checking the (N +1) level cache L2-A1 and/or checking the (N +2) level cache L3-A results in the following determination: the data/memory location has not been logged on behalf of another processing unit, then processor 201a may cause the inflow to cache L1-a1 to be logged by value on behalf of processing unit a 1. The per-value log record may include, for example, logging the memory address and memory value in the data packet for processing unit a 1. Note that per-value logging may include any number of compression techniques to reduce the number of bits required to complete the actual logging.

As described in connection with fig. 1, processor(s) 102 may include buffer(s) 102e that may be used to temporarily store trace data. Thus, in the method 300, "causing" different types of data to be logged may include the processor 102 storing such data in the buffer(s) 102 e. Additionally or alternatively, it may include the processor 102 transferring such data to the tracker 104a, writing such data to the tracking data store 104e, and/or notifying the tracker 104a that data is available in the buffer(s) 102 d. In some embodiments, buffer(s) 102d may include one or more reserved portions of cache(s) 102 b. Thus, using buffer 102e, causing data for the memory location to be logged (by reference or per value) on behalf of the first processing unit in act 304/304 may include: the logging is delayed based on the availability of resources such as processor cycles, memory locations, bus bandwidth, etc. In embodiments where buffer(s) 102d includes one or more reserved portions of cache(s) 102b, the deferred logging may include invalidating the cache line (in the N-level cache and/or the (N + i) -level cache) rather than evicting it, so as to preserve data for the memory location for the purposes of deferred logging.

The description of method 300 has been directed to an upper level cache having "knowledge" about a lower level cache. The particular form of "knowledge" that the upper level cache retains about the lower level cache may vary, as exemplified below.

In a basic form, this "knowledge" may simply be that there is a cache line in the upper level cache that corresponds to the cache line(s) in the lower level cache(s) (i.e., a cache line that corresponds to the same memory location and memory data). As mentioned above, in an inclusive cache, the upper layer(s) store a superset of data in the layer(s) below them. For example, assume that the cache in FIG. 2A is inclusive. In this case, when activity of processing unit A2 causes a location from system memory 202 to be imported into cache L1-A2, that same memory location is also cached in caches L2-A1 and L3-A. If the activity of processing unit A2 is being tracked, an embodiment may have the memory location and its value logged on behalf of processing unit A2. Later, if the activity of processing unit A1 caused the same location from system memory 202 to be imported into cache L1-A1, and that location still stored the same data, then that location is provided from cache L2-A1 because cache L2-A1 already has that data. The prior art may again log this data for processing unit A1 based on the inflow to cache L2-A1. However, embodiments herein may alternatively recognize that memory locations and their values already exist in cache L2-A1, and thus already exist in cache L1-A2. Because processing unit A2 is being logged, embodiments may recognize that the memory location and its value were logged on behalf of processing unit A2, and thus cause this new activity of processing unit A1 to be logged with reference to the log data that was previously logged on behalf of processing unit A2.

More elaborate forms of "knowledge" of the upper level cache are also possible. For example, embodiments may extend a cache line in one or more cache layers with additional "accounting" (or logging) bits that enable processor 102 to identify, for each cache line implementing the accounting bits, whether the cache line has been logged (possibly with the identity of the processing unit(s) that logged the cache line). To understand these concepts, fig. 4A illustrates an example shared cache 400a similar to shared cache 203 of fig. 2B, where each of cache lines 404 includes one or more additional accounting bits 401. Thus, each cache line 404 includes accounting bit(s) 401, regular address bits 402, and value bits 403.

Alternatively, fig. 4B illustrates an example of a shared cache 400B, shared cache 400B including a regular cache line 405 storing a memory address 402 and a value 403, and one or more reserved cache lines 406 for storing accounting bits applicable to regular cache line 405. The bits of reserved cacheline(s) 405 are allocated to different groups of accounting bits that each correspond to a different one of the plurality of regular cachelines 405.

In a variation of the example FIG. 4B, the reserved cache line(s) 406 may be reserved as way(s) in each index of the set-associative cache (which will be discussed in detail later). For example, in an 8-way set associative cache, one way in a set (set) may be reserved for accounting bits applicable to the other seven ways in the set. This may reduce the complexity of implementing the reserved cache line and may speed up access to the reserved cache line, since all ways in a given set are typically read in parallel by most processors.

Regardless of how the accounting bits are actually stored, the accounting bit(s) 401 for each cache line may include one or more bits that serve as a flag (i.e., turned on or off) that is used by the processor(s) 102 to indicate whether the current value in the cache line represents that the processing unit is being logged (or alternatively, whether it is being consumed by a processing unit participating in the logging). Thus, the check in act 302 may include using the flag to determine whether the cache line has been logged by a processing unit participating in the logging.

Alternatively, the accounting bits 401 for each cache line may comprise a plurality of bits. The multiple bits may be used in several ways. Using one approach, referred to herein as "unit bits," the accounting bits 401 for each cache line may include a number of unit bits equal to the number of processing units 102a of the processor 102 (e.g., the number of logical processing units if the processor 102 supports hyper-threading, or the number of physical processing units if hyper-threading is not supported). These unit bits may be used by processor 102 to track which particular processing unit or units have logged a cache line (if any). Thus, for example, a cache shared by two processing units 102a may associate two unit bits with each cache line.

In another approach using multiple billing bits 401 (referred to herein as "index bits"), the billing bits 401 for each cache line may include multiple index bits sufficient to represent an index and possibly a "reserved" value (e.g., -1) for each processing unit 102a of the processors 102 of the computer system 101 participating in the logging. For example, if processor 102 includes 128 processing units 102a, these processing units may be identified by an index value (e.g., 0-127) that uses only seven index bits per cache line. In some embodiments, an index value is reserved (e.g., "invalid") to indicate that no processor has logged a cache line. Thus, this would mean that seven index bits would actually be able to represent 127 processing units 102a, plus the reserved value. For example, the binary values 0000000-. Thus, the unit bit may be used by the processor 102 to indicate whether a cache line has been logged (e.g., a value other than-1) and as an index to a particular processing unit that logged the cache line (e.g., the processing unit that most recently consumed the cache line). This second approach using multiple accounting bits 401 has the advantage of supporting a large number of processing units in cache 102b with little overhead, and the disadvantage of being less granular than the first approach (i.e., only one processing unit is identified at a time).

In view of the foregoing, it will be appreciated that in act 302, checking the (N + i) -level cache to determine whether data for the memory location has been previously logged on behalf of the second processing unit comprises: it is determined whether one or more accounting bits are set for a cacheline corresponding to a memory location in the (N + i) level cache.

Another mechanism that may be used to determine whether a cache line has been logged is to utilize set-associative caching and way-locking (way-locking). Since the processor's cache 102b is typically much smaller (typically several orders of magnitude) than the system memory 103, the memory locations in the system memory 103 are typically much more than the lines in any given layer of the cache 102 b. As such, some processors define a mechanism for mapping multiple memory locations of system memory to line(s) of one or more cache layers. Processors typically employ one of two general techniques: direct mapping and associative (or set associative) mapping. Using direct mapping, different memory locations in system memory 103 are mapped to only one row in a cache layer, such that each memory location can only be cached to a particular row in that layer.

On the other hand, using set associative mapping, different locations in system memory 103 may be cached to one of multiple lines in a cache layer. FIG. 5 illustrates an example 500 of set associative mapping between system memory and cache. Here, the cache lines 504 of the cache layer 502 are logically divided into two different groups of cache lines, including a first group (identified as index 0) of two first cache lines 504a and 504b, and a second group (identified as index 1) of two cache lines 504c and 504 d. Each cache line in the set is identified as a different "way," such that cache line 504a is identified by index 0, way 0, cache line 504b is identified by index 0, way 1, and so on. As further depicted, memory locations 503a, 503c, 503e, and 503g (memory indices 0, 2, 4, and 6) are mapped to index 0. In this way, each of these locations in system memory may be cached to any cache line at index 0 within the set (i.e., cache lines 504a and 504 b). The particular pattern of mapping depicted is for purposes of illustration and concept only and should not be construed as the only way in which a memory index may be mapped to a cache line.

Set associative caches are commonly referred to as N-way set associative caches, where N is the number of "ways" in each set. Thus, cache 500 of FIG. 5 will be referred to as a 2-way set associative cache. Processors typically implement N-way caches, where N is a power of 2 (e.g., 2, 4, 8, etc.), where the values of N are typically selected to be 4 and 8 (although embodiments herein are not limited to any particular value or subset of values of N). Notably, a 1-way set associative cache is generally equivalent to a direct mapped cache, as each set contains only one cache line. Additionally, if N is equal to the number of lines in the cache, it is referred to as a fully associative cache because it includes a single set that contains all of the lines in the cache. In a fully associative cache, any memory location may be cached to any line in the cache.

Note that fig. 5 represents a simplified view of the system memory and cache in order to illustrate the general principles. For example, although FIG. 5 maps individual memory locations to cache lines, it should be understood that each line in the cache may store data in system memory relating to multiple addressable locations. Thus, in FIG. 5, each location (503a-503h) in system memory (501) may actually represent a plurality of addressable memory locations. Additionally, the mapping may be between actual physical addresses in system memory 501 and lines in cache 502, or an intermediate layer of virtual addresses may be used.

The set associative cache may be used to determine whether a cache line has been logged using a way lock. For some purpose, a way lock may lock or reserve one or more ways in a cache. In particular, embodiments herein utilize way locking to reserve one or more ways for a processing unit being traced, such that the locked/reserved ways are dedicated to storing cache misses related to execution of that unit. Thus, referring again to FIG. 5, if "way 0" is locked out for the tracked processing unit, cache lines 504a and 504c (i.e., index 0, way 0 and index 1, way 0) will be dedicated to cache misses associated with the execution of that unit, and the remaining cache lines will be used for all other cache misses. Thus, to determine whether a particular cache line has been logged, the processor 102 need only determine whether the cache line stored in the "N + 1" cache layer is part of a way that has been reserved for the tracked processing unit.

In view of the foregoing, it will be appreciated that in act 302, checking the (N + i) -level cache to determine whether data for the memory location has been previously logged on behalf of the second processing unit comprises: it is determined whether a cache line in the (N + i) level cache corresponding to the memory location is stored in a way corresponding to the logged processing unit.

As explained earlier, the caches operate according to a CCP, which defines how coherency is maintained between the various caches when the processing unit reads and writes cache data from and to the cache data, and how to ensure that the processing unit always reads valid data from a given location in the cache. Thus, in conjunction with operating the cache, processor 102 maintains and stores CCP state data. The granularity with which different processors and/or different CCPs track cache coherency states and make this cache coherency data available to tracker 104a may vary. For example, in one aspect, some processors/CCPs track cache coherency for each cache line and for each processing unit. Thus, these processor/CCPs may track the state of each cache line associated with each processing unit. This means that a single cache line may have information about its state because it is associated with each processing unit 102 a. The granularity of the other processors/CCPs is small and only cache coherency at the cache line level is tracked (and information per processing unit is missing). On the other hand, because only one processor may monopolize a line at a time (exclusive, modified, etc.), the processor manufacturer may choose to track cache coherency at the cache line level for efficiency reasons only. As an example of intermediate granularity, the processor/CCP may track the cache coherency of each cache line, as well as the index of the processing unit with the current cache line state (e.g., the index of the four processing unit processor is 0, 1, 2, 3).

Regardless of the granularity at which CCP state data is maintained at a given processor, the CCP state data may be included in the "knowledge" of the data cached by the (N + i) level cache. In particular, CCP state data associated with a given cache line in the (N + i) level cache may be used to determine whether the cache line has been logged by one of the processing units. For example, if the CCP status data indicates that a particular processing unit has considered a given cache line as "shared," that data, in turn, may be used to determine that the processing unit has logged reads from the cache line. Thus, it will be appreciated that in act 302, checking the (N + i) -level cache to determine whether data for the memory location has been previously logged on behalf of the second processing unit may comprise: a determination is made as to whether a cache line in the (N + i) level cache corresponding to the memory location has associated CCP status data that may be used to determine that the cache line has been logged.

In act 303, the data inflow may be logged by reference to previously logged data (typically logged by a different processing unit that caused the current inflow). Logging by reference may be accomplished using one or more of a variety of methods, including combinations thereof, some of which are described below.

A first method logs by referring to memory addresses of previous log records. For example, assume processing unit A2 in FIG. 2A has recorded data representing a particular memory address (i.e., in system memory 202) and particular data stored at that memory address. Later, if the particular memory address/particular data is an inflow for processing unit A1, processing unit A1 may store a log entry that identifies (i) the particular memory address and (ii) processing unit A2. Here, processing unit a1 avoids re-logging the actual data stored at the memory address (which may be of considerable size). Some variations of this first method may also store ordering data, such as a series of MINs from data streams incrementing across processing units a1 and a 2. The MIN may later be used to rank the inflow of processing unit a1 for one or more events at processing unit a2 (e.g., those events also associated with MIN from the same series). Accordingly, in act 303, causing the data for the memory location to be logged on behalf of the first processing unit may include one or more of the following by reference to log data that has been previously logged on behalf of the second processing unit: the address of the logging memory location, or the address and ordering data (such as MIN) of the logging memory location.

The second method logs by referring to the previous owner of the cache line storing the data. For example, assume that processing unit A2 in FIG. 2A has logged a first inflow of data. Assume also that the first inflow causes data to be cached in a cache line of an (N + i) level cache (e.g., cache L2-a1), where processing element a2 is identified as the owner of the cache line. Later, if processing unit A1 caused a second influx of the same data, processing unit A1 may become the owner of the cacheline in the (N + i) level cache. Processing unit A1 may then store a log entry identifying the previous owner of the cacheline (i.e., processing unit A2) so that the log entry of A2 may be used to obtain the data later. This means that logging by reference may involve recording the identity of the cache line as well as the previous owner of the cache line (e.g., it may be possible to avoid recording memory addresses and memory values). Thus, in act 303, causing the data for the memory location to be logged on behalf of the first processing unit by reference to log data that has been previously logged on behalf of the second processing unit may include: the second processing unit log is recorded as a previous owner of the cacheline corresponding to the memory location.

A third method logs by reference to CCP data. For example, as mentioned above, the CCP may store cache coherency states for each cache line that different processing units use for reads and writes. The granularity of this data may vary depending on the implementation of the processor, but may, for example, track the cache coherency state of each cache line associated with each processing unit, track the cache coherency state of each cache line and the index (e.g., 0, 1, 2, 3, etc.) of the processing unit that owns the current cache line state, and so on. A third approach utilizes available CCP data to track which processing unit(s) previously owned the cache coherency state of the cache line, which cache coherency state may then be used to identify which processing unit(s) have logged the value of the cache line. This means that logging by reference may involve logging CCP data for a cache line (e.g. again possibly avoiding logging memory addresses and memory values). Thus, in act 303, causing the data representative of the memory location to be logged for the first processing unit may include logging CCP data referencing the second processing unit by reference to log data that has previously been logged for the second processing unit.

The fourth method logs by referencing a cache way. As mentioned, the set associative cache may be used to determine whether a cache line has been logged using way locking. For example, assume a way lock is used to reserve one or more ways for processing unit P2 and P2 logs a first in-flow of data. The first inflow also causes the (N + i) level cache (e.g., cache L2-A1) to store the data of the first inflow in a cache line associated with the way. When another processing unit (e.g., P1) has a second inflow of the same data, the presence of the cacheline in the (N + i) level cache indicates that P2 has logged the data. Embodiments may log references to log data of P2 based on annotating ways of a memory cache line, and may again potentially avoid logging memory addresses and memory values. This embodiment may also be used in conjunction with record ordering information (e.g., MIN) to order events between P1 and P2. Thus, in act 303, causing the data for the memory location to be logged on behalf of the first processing unit may include one or more of the following by reference to log data that has been previously logged on behalf of the second processing unit: logging references to cache ways, or logging references to cache ways and ordering data.

In addition to logging the inflow of the first processing unit based on the previous inflow of the second processing unit, embodiments also include optimizations for reducing (or even eliminating) logging by a single processing unit when there are multiple inflows of the same data. For example, referring to FIG. 2A, processing unit A1 may cause a cache miss in an N-level cache (e.g., L1-A1 cache) of particular data at a memory location. In response, the cache hierarchy may import the data into the L1-A1 cache, and potentially also into the (N + i) level cache (e.g., the L2-A1 cache and/or the L3-A cache). Additionally, the inflow to processing unit a1 may be logged by value. Later, the data may be evicted from the L1-A1 cache. In a typical caching environment, this may cause data to also be actively evicted from the L2-A1 cache and/or the L3-A cache. However, rather than causing eviction(s) in the L2-A1 and/or L3-A caches, embodiments may reserve the appropriate cache line(s) in one or more of these (N + i) level caches. Thus, method 300 may include evicting a first cache line in a first level-N cache corresponding to a memory location while retaining a second cache line in a level-N + i cache corresponding to the memory location.

Later, if processing unit A1 caused a subsequent cache miss in the L1-A1 cache for the same data, the reserved cache line(s) in the (N + i) level cache (e.g., L2-A1 and/or L3-A cache) may be used to determine that the data has been logged on behalf of processing unit A1. Thus, in some embodiments, the subsequent cache miss is logged by processing unit A1 with reference to a previous log entry. In other embodiments, the log entry may be omitted entirely for this subsequent cache miss because processing unit A1 already has this data in its trace. Thus, the method 300 may include: based on detecting a subsequent inflow to the first N-level cache, the subsequent inflow further including data stored at the memory location, the subsequent inflow is logged by the reference based on the presence of the second cache line. Additionally or alternatively, method 300 may include (i) detecting a subsequent inflow to the first N-level cache based on additional code executing at the first processing unit, the subsequent inflow further including data stored at the memory location, and (ii) determining that the subsequent inflow does not need to be logged based at least on the detection of the subsequent inflow to the first N-level cache and based at least on the presence of the second cache line.

It will be appreciated that the first embodiment of logging at a lower cache level based on a processor examining one or more upper cache levels may be implemented as processor control logic (e.g., circuitry and/or microcode) implementing the method 300 of FIG. 3. As such, the processor 102 implementing this embodiment may include processor control logic that detects the inflow to a lower level (e.g., L1) cache, and then checks one or more upper level caches (potentially step-by-step) to determine whether the inflow may be logged by reference, or even whether the inflow need not be logged at all, as described in method 300.

Sending log record request(s) to upper cache layer(s) based on lower cache layer, logging cache misses at lower cache layer

In a second embodiment, the processor detects an inflow (i.e., a cache miss) to a lower processor cache (e.g., L1) based on activity of the first processing unit (such as a read from a particular memory address), and the lower processor cache then requests an upper cache to log the inflow and/or requests the upper cache to inform the lower how to log the inflow. The upper-level cache then determines whether and how the inflow needs to be logged (e.g., by value or by reference), and/or passes the request to another upper-level cache (if present) to determine how to log the inflow in the absence of the necessary knowledge of the request. This may continue to N cache levels.

The processor 102 implementing this second embodiment can potentially do so by implementing common (or at least very similar) control logic at all upper level cache(s) or at least all upper level cache(s) participating in the logging process. In some implementations, the control logic required to implement the second embodiment may not extend as much as the control logic required to implement the first embodiment, but based on utilizing knowledge of the upper cache level(s), provides many (or all) of the same advantages derived from logging the inflow at the lower cache level. Additionally, since the cache levels have passed CCP messages between each other in most processors, the control logic required to implement the second embodiment can potentially be implemented as an extension to existing control logic.

According to this second embodiment, fig. 6 illustrates a flow diagram of an example method 600 for an upper cache layer to determine how to log an inflow through a lower cache layer based on a logging request of the lower cache layer. Similar to method 300, method 600 may be implemented in a microprocessor environment (such as the example environment of FIG. 2A) depicting a processor 201a, processor 201a including a plurality of processing units (e.g., two or more of processing units A1-A4), and which includes a plurality of caches arranged into a plurality of cache levels. These caches may include a plurality of first caches in a first cache tier (e.g., two or more of caches L1-a 1-L1-a 4) and one or more second caches in a second cache tier (e.g., one or more of caches L2-a1 or L2-a2 or cache L3A). These caches may include a particular second cache (e.g., L2-A1 or L3-A) in the second cache tier that serves as a backing store for a particular first cache (e.g., L1-A1) in at least the first cache tier. For simplicity, method 600 refers to a particular first cache as a "first cache" and a particular second cache as a "second cache. The microprocessor environment may include control logic (e.g., microcode 102c and/or circuitry) to perform the method. In some embodiments, such control logic is implemented at one or more upper level cache layers (e.g., cache layers L2 and/or L3 in fig. 2A).

The method 600 is performed at the second cache described above, which participates in logging and begins in act 601, where the second cache receives a log record request from an internal cache layer. In some embodiments, act 601 may include the second cache receiving a logging request referencing a particular memory address from the first cache. For example, cache L2-A1 in the L2 cache tier (cache L3-A if method 600 is performed at the L3 cache tier) may receive log record requests from cache L1-A1 in the L1 cache tier. The logging request may be based on activity of processing unit A1 (such as a read to a particular memory address (e.g., in system memory 202)) that causes an inflow of data to first cache L1-A1. In the environment of FIG. 2A, data in this inflow may be provided from cache L2-A2, cache L3-A, or system memory 202.

Based on the request, method 600 proceeds to act 602 where the second cache determines whether a cache line for the memory address exists at the cache level in act 602. In some embodiments, act 602 may include determining, based on the request, whether a cache line corresponding to the memory address exists in the second cache. For example, based on receiving the request, cache L2-A1 in the L2 cache tier (cache L3-A if method 600 is performed at the L3 cache tier) may determine whether it contains a cache line that caches a particular memory address from the log record request. Although such a cache line will typically be present where the cache hierarchy includes an inclusive cache (e.g. where a second cache stores a superset of data in cache(s) in a first cache level below it), it will be appreciated that this may not be the case if the cache hierarchy is exclusive or exhibits some exclusive behavior.

Following the "no" branch of act 602 (i.e., when no cacheline exists in a particular second cache), the method 600 reaches act 603, where in act 603 the second cache may determine whether it is the outermost journaling cache layer. As will be discussed, based on the results of act 603, method 600 may include (i) the second cache causing the cache line to be logged (i.e., following the path of act 608) when there is no third cache (e.g., within a third cache tier) that participates in logging and acts as a backing store for at least the second cache, or (ii) the second cache forwarding the request to the third cache (i.e., following the path of act 606) when the third cache does exist.

For example, if the second cache is cache L2-A1, at act 603, the second cache may determine whether cache L3-A exists and participates in logging (thus L2-A1 is not the outermost logging cache layer). If cache L3-A does exist, it will be understood that in some implementations, depending on the current configuration of the processor, the cache may participate in logging at one point in time and not at another point in time. In another example, if the second cache is cache L3-a, then at act 603, the second cache may determine that there is no outer cache layer, and thus the second cache is the outermost journaling cache layer. Note that there may be intermediate non-logging cache layers between logging cache layers. For example, if act 603 is performed by cache L2-A1, and some L4 cache tier will exist, then the L3 cache tier may be non-journaling and the L4 cache tier may be journaling.

If the determination from act 603 is that the second cache is not an external logging cache layer (i.e., the "No" branch from act 603), then method 600 proceeds to act 606, where the second cache forwards the logging request to the next logging cache layer in act 606. The method 600 is then repeated at the cache of the layer. For example, if the second cache is cache L2-A1, it may forward the request to cache L3-A, and cache L3-A may repeat method 600. This can be extended to many logging cache levels that exist. In some implementations, upon reaching act 606, the second cache layer may send one or more reply messages to the first cache instructing it to send its log record request directly to the next log record cache layer, rather than forwarding the log record request to the next log record cache layer.

On the other hand, if the determination from act 603 is that the second cache is the outermost logging cache layer (i.e., the "yes" branch from act 603), then method 600 proceeds to act 608, where the second cache causes the inflow to be logged in act 608. As will be discussed later, the logging at act 608 may be performed by value or by reference (as the case may be), and the actual logging at act 608 may be performed at the current and/or lower cache layers.

Note that act 603 may be an optional act, as indicated by the dashed line in the decision block of act 603, depending on the computing environment in which method 600 is performed. For example, if the caching hierarchy includes only one upper caching layer that participates in logging (and performs the method 600), that caching layer will always be the "outermost" logging caching layer. In these environments, act 603 may not be necessary. Further, even if there are a plurality of logging cache layers, the outermost logging cache layer may have inherent knowledge that it is the outermost layer. In either case, the no determination at act 602 may thus simply proceed to act 608.

Returning to act 602, and following the "yes" branch (i.e., when a cache line exists in the second cache), the method 600 reaches act 604, where the second cache determines whether the cache line is logged in act 604. The determination may include determining whether the cache line was recorded by the second cache log, or whether the cache line was recorded by some other cache log and the second cache was aware of the log record. The manner in which the second cache determines that the cache line has been logged (and potentially by the processing unit (s)) may depend on any of the mechanisms described in connection with the first cache logging embodiment (e.g., including embodiments described in connection with fig. 4A, 4B, and 5, for example). For example, the second cache may store accounting bits (i.e., flag bits, unit bits, and/or index bits) as described in connection with fig. 4A and 4B, the second cache may utilize way locking as described in connection with fig. 5, and/or the second cache may store and rely on CCP data.

As will be discussed, if the cacheline is not determined by the second cache to be journaled, the method 600 may include the second cache forwarding the request to the next journaling cache layer (i.e., following the path of act 606) and/or the journaling cacheline (i.e., following the path of act 608). On the other hand, if the cacheline is determined by the second cache to be journaled, the method 600 may comprise: the second cache causes the cache line to be logged (i.e., following the path of operation 608) when the second cache has not determined whether the first cache knows the current value stored in the cache line of the cache, or the second cache determines that the cache line does not need to be logged (i.e., following the path of act 609) when it determines that the requesting processor knows the current value stored in the cache line of the second cache.

For example, if the determination from act 604 is that the cache line was not determined by the second cache to not be logged (i.e., the "no" branch from act 604), then at act 608 the second cache may cause the inflow to be logged while potentially notifying the outer logging cache layer (if present) of the logging that occurred (i.e., act 607). If act 607 is performed, causing the cache line to be logged, if at act 608 the cache line is not determined by the particular second cache to be logged, may include: a third cache layer is determined to exist and the third cache is notified that the cache line has been logged by the particular second cache by value. Note that in method 600, acts 607 and 608 may be performed in any order relative to each other (including acts performed in parallel). Note that act 607 can cause method 600 to be performed at the next log record cache layer.

Alternatively, FIG. 6 illustrates that if the determination from act 604 is that the cache line is not determined by the second cache to be logged, the second cache, at act 603, can determine whether it is the outermost logging cache layer and, based on the result, forward the request to the next logging cache layer at operation 606 or log the inflow at operation 608. Essentially, these alternate paths convey the information that if there is a "no" determination at operation 604, the second cache can (i) log the inflow and notify the next logging cache layer (if any) (i.e., so that the next layer is aware of the logging event for later use), and/or (ii) forward the request to the next logging cache layer, because the second cache (even higher layers) may contain knowledge that may cause the inflow to be logged by reference or not logged at all.

Returning to act 604, if the second cache knows that the cache line is being logged (i.e., the "yes" branch from act 604), then at act 605 the second cache determines whether the processing unit that caused the logging request has a current value in the cache line. It should be appreciated that the second cache may have a more recent value for the requested memory address than the first cache currently possesses. For example, if the first cache is cache L1-A1 and the second cache is cache L2-A1, it may be that cache L2-A2 has a newer value for a particular memory address than L1-A1 that caused the read of the logging request by processing unit A1 (e.g., due to the activity of processing unit A2) in act 601. If it can be positively known that the first cache has a current value in the cache line at the second cache (i.e., the "YES" branch of act 605), then the second cache may choose not to log any content at act 609 (i.e., because the current value has already been logged). On the other hand, if it cannot be known with certainty that the first cache has the current value in the cacheline at the second cache (i.e., the "no" branch of act 605), then the second cache may cause the inflow to be logged at act 608, while potentially notifying the outer logging cache layer (if present) of the logging that occurred (i.e., act 607). Again, acts 607 and 608 may be performed in any order relative to each other (including acts performed in parallel).

As mentioned, logging of the inflow at act 608 may be performed on a case-by-case basis, by value or by reference. Typically, if the value of the inflow cannot be located based on the trace, the inflow is logged by value (e.g., by processor activity during replay or previously logged cache lines). If the value of the inflow can be obtained by replaying logged processor activity, or if the value of the inflow is stored in a previously logged cache line, the inflow can be logged by reference. It is worth noting that logging the inflow by value is legal even in cases where it may be legal by reference to the log. A determination may be made to log by value, for example, to save processing time during tracing, to create a trace that is more easily replayed, and so on. Thus, it will be appreciated that act 608 may include causing a cache line to be logged based on directly logging the value of a particular memory address and/or logging a reference to a previous log entry for a particular memory address.

One situation in which logging may be able to be performed by reference in method 600 is whether act 608 has been reached by act 605 (i.e., the second cache knows that the value has been logged, but the requesting processor cannot explicitly know the current value). Here, logging at act 608 may be performed for the first cache with reference to values known to be logged by the second cache. For example, cache L1-A2 may have logged the current value, and cache L2-A1 knows this, so the inflow may be logged for cache L1-A1 with reference to the logging of L1-A2.

In these cases, it may also make sense not to log any content if the replay code may otherwise restore the value. For example, the current inflow caused by processing unit A1 may have been logged in connection with the previous activity of processing unit A2, so it is possible to log the current inflow at act 608 by reference to the log of A2. However, if at this point method 600 refuses to log anything for A1, then the trace may still be correct. This reduces the size of the trace and tradeoffs the need to locate the value of this previous log record in the trace of the other processing unit during replay.

The task of locating the value of a previous log record during replay depends on the order in which at least part of the events can be reproduced between the processing units of the different log records. Some content may be included in the trace to help locate previously logged values. For example, logging cache evictions helps determine at replay that the value needed to satisfy the read of a1 is not available in cache a1-L1 (i.e., because it has been evicted). Thus, this value may then be searched in the trace(s) for other processing unit(s). In another example, logging CCP data may also help determine that the values needed to satisfy a1 read during replay are not available or currently not in cache a 1-L1. Thus, this value may then be searched in the trace(s) for other processing unit(s). Note that CCP data can potentially indicate where to look for the current value. In another example, cached geometry knowledge may help locate the required log entries. For example, processing units A1 and A2 may be known to share the same L2 cache (i.e., L2-A1). Thus, for example, in contrast to searching A3 and A4 in a trace, it makes sense to first search the trace of A2 for the desired log entries.

Another situation where logging may be able to be performed by reference is whether act 608 has been reached when method 600 is performed at the current cache level based on the lower cache level sending a notification to the current cache level at act 607. Here, the lower cache layer will log the inflow (by value or by reference), so the current cache layer can reference the lower cache layer's log for logging.

As also mentioned, logging at act 608 may be performed at the current cache level or at a lower cache level. For example, in some implementations, rather than the second cache layer performing logging itself upon reaching act 608, one or more reply messages may be sent back down to the first cache instructing the first cache inflow should be logged and how to log the inflow (i.e., by value or by reference, if by reference, where to reference the log entry). The reply message(s) may also instruct the first cache how to set accounting bits, save CCP data, and so on. Similarly, at act 609, the second cache layer may send a reply message to the first cache layer to inform it that logging is not necessary. If the original journaling request has propagated through more than one journaling cache layer, these reply messages may be propagated back down through the layers, or may be sent directly to the original requestor. In view of the foregoing, it will be appreciated that act 608 may include causing a cache line to be logged based on instructing the first cache to directly record a value for a particular memory address, or instructing the first cache to record a reference to a previous log entry for the particular memory address.

Regardless of how logging is performed, act 608 may include the fact that the second cache layer sets any appropriate logging accounting bits (e.g., flag bits, unit bits, or index bits), or saves any appropriate CCP messages, such that the incoming volume of the document record is logged. As such, it will be appreciated that causing the cacheline to be logged in act 608 can include marking the cacheline as being logged within a particular second cache (e.g., by appropriately setting an accounting bit associated with the cacheline).

In some embodiments, logging at act 608 may include the second cache actively notifying one or more underlying cache layers that it has logged a cache line, and possibly how it logged a cache line. For example, if cache L2-A1 has logged at act 608, it may send one or more messages (i.e., L1 caches other than the cache that initiated the logging request) to one or more of caches L1-A2-L1-A4 to inform them that it has logged the cache line. This information may include whether cache L2-A1 has recorded a cache line by value or by reference to a log. If the cacheline is recorded by referencing a log, then cache L2-A1 may even send information about where the original log data exists. In response, a cache (e.g., one or more of L1-A2-L1-A4) may store information that records the fact that cache L2-A1 has logged a cache line (potentially including how cache L2-A1 has logged a cache line). This information may be stored, for example, in additional accounting bits within these L1 caches. In this way, if one of these L1 caches later determines that they need a logging cache line, they can know in advance that it has been logged and avoid sending logging requests to upper level caches or sending problems of how to log to upper level caches.

It is noted that any of the techniques discussed above in connection with the first embodiment for accomplishing delayed logging using dedicated portions of the buffer(s) 102(e) and/or the cache(s) 102b are also applicable to the second embodiment. As such, it will be appreciated that causing the cache line to be logged in act 608 may include logging the cache line within a trace buffer, such as a portion of buffer(s) 102e and/or cache(s) 102 b.

While method 600 focuses on performing actions at an upper logging cache layer, fig. 7-9 illustrate some example methods that may be performed at a lower caching layer (e.g., the L1 caching layer(s) that initiated the original logging request). In particular, although method 600 focuses on performing logging (including setting a logging state of a cacheline to indicate that a value of the cacheline has been logged (e.g., setting an accounting bit associated with the cacheline, storing CCP data, etc.)), these methods are related to clearing the logging state when the value of the cacheline is no longer logged at a later time.

A cache may contain a cache line whose logging status is set because the processing unit using the cache performed a memory read while logging is enabled for that processing unit. The cache may also receive a cache line from the upper level cache and the logging status has been set on the cache line. As mentioned above, a cache line may also have a logging status set because the higher level cache has actively notified it that the cache line has been logged. Typically, when a processing unit performs a write operation to a cache line while disabling the processing unit's logging, the cache line's logging state is typically cleared.

First, FIG. 7 illustrates a flow diagram of an example method 700 for managing a logging state of a cache line when a processing unit transitions between enabling logging to disabling logging. As with method 600, method 700 may be implemented in a microprocessor environment, such as the example environment of FIG. 2A. Generally, the method 700 operates after a processing unit (e.g., a1) has operated with logging enabled, and it uses a cache (e.g., L1-a1) that now includes one or more cachelines that have been logged. If the processing unit writes to one of these logged cache lines, method 700 retains or clears the logging state depending on whether logging is currently enabled or disabled for the processing unit.

Method 700 begins at act 701, where it detects a write to a cacheline marked as logged in act 701. In some embodiments, act 701 may include detecting a write to a cache line in the first cache having a set logging state. For example, the first cache may be cache L1-A1. The cache may have cache lines that were previously marked as logged (e.g., by setting its accounting bits appropriately) based on the memory read of processing unit a 1. For example, the cache line may correspond to a particular memory address discussed above in connection with method 600.

Next, method 700 includes an act 702, in act 702, determining whether logging is enabled. In this context, act 702 determines whether logging is enabled for a processing unit associated with the first cache. In some embodiments, act 702 may include determining whether logging is enabled for a particular processing unit based on detecting a write. For example, control logic for cache L1-A1 may determine whether processing unit A1 has enabled logging. If logging is enabled (i.e., the "YES" branch from act 702), then the logging state for the cache line may be preserved at act 703. Thus, in some embodiments, act 703 may include preserving a logging state for the cache line based at least on enabling logging for the particular processing unit.

Alternatively, if logging is disabled (i.e., the "no" branch of act 702), the logging state for the cache line may be cleared at act 704. Thus, in some embodiments, the action may include clearing a logging state for the cache line based at least on disabling logging for the particular processing unit. For example, cache L1-A1 may clear the billing bits for the cache line as needed.

As shown, in addition to clearing the logging state, method 700 also includes notifying the next logging cache layer. In some embodiments, act 705 may include notifying at least one of the one or more second caches that its journaling state for the second cache line should be cleared based at least on disabling journaling for the particular processing unit. For example, one of the second caches may be cache L2-A1, and thus cache L1-A1 may notify cache L2-A1 to clear the journaling state of its copy of the cache line. Note that acts 704 and 705 may be performed in any order relative to each other (including in parallel).

Although not depicted in fig. 6, method 600 may correspondingly include the particular second cache receiving a message from the first cache indicating that another cache line in the first cache that also corresponds to the memory address is marked as not logged within the first cache. Method 600 may also include marking the cache line as not logged within the particular second cache based on the message.

FIG. 8 illustrates a flow diagram of an example method 800 for managing a journaling state of a cache line when a processing unit with logging disabled receives only the cache line for writes from a parent cache. As with methods 600 and 700, method 800 may be implemented in a microprocessor environment (such as the example environment of fig. 2A). Generally, method 800 operates when a processing unit (e.g., a1) operates with logging enabled, and uses parent cache(s) (e.g., L2-a1 and/or L3-a) to contain a cache line (e.g., L1-a1) that has a journaled state set, and the cache employs the cache line to write from the parent cache.

The method 800 begins at act 801, where in act 801 a cache requests a cache line from an upper level cache to write with logging disabled. For example, a cache miss may occur in cache L1-A1 based on a request from processing unit A1 to write to a particular memory address. As a result, cache L1-A1 may request a copy of the appropriate cache line from cache L2-A1 or cache L3-A.

In some cases, cache L1-A1 may receive a cache line with a cleared journaling state. As such, method 800 may include an act 802 in which the cache receives only cache lines from the upper level cache that have cleared the logging state. The logging state may be cleared in the received cache line, for example, because (i) the logging state is not set in the upper-level cache, or (ii) the upper-level cache knows that logging is disabled at processing unit A1, so the upper-level cache clears the logging state when the cache line is provided to caches L1-A1. For example, method 600 may include: a message is received from the first cache requesting a cache line write, and the cache line is sent to the first cache, the cache line marked as not logged based at least on the logging disabled for the first cache.

In other cases, cache L1-A1 may receive a cache line with the journaled state set. As such, method 800 may include an act 803, in which act 803 the cache receives only the cache line from the upper level cache with the logging status set. For example, because the logging state is set in the upper level cache, the logging state may be set in the received cache line.

Next, method 800 may include an act 804, in which act 804 the cache performs a write to the cache line while logging is still disabled. For example, cache L1-A1 may complete the original write request from processing unit A1 by writing the appropriate value to the cache line. Next, method 800 may include acts 805 and 806, in act 805 the logging state for the cache line is cleared, and in act 806 the upper level cache is notified to clear its logging state for the cache line. Although act 805 is described separately from act 804, it should be noted that clearing the logging state for the cache line may be a natural part of performing the write operation at act 804. For example, where logging is disabled, any write operation may cause the logging state to be cleared for the written cache line. As such, the arrow between acts 804 and 805 is shown in dashed lines to indicate that act 804 may be optional in nature. Act 806 may operate in a similar manner to act 705 discussed above in connection with method 700.

Similar to act 705 of method 700, when act 806 is performed, method 600 may correspondingly include the particular second cache receiving a message from the first cache indicating that another cache line in the first cache that also corresponds to the memory address is marked as not logged within the first cache. Method 600 may also include marking the cache line as not logged within the particular second cache based on the message.

FIG. 9 illustrates a flow chart of an example method 900 for managing the journaling state of a cache line when a processing unit writes to a cache line that the processing unit has occupied the "owned" CCP state. As with method 600-800, method 900 may be implemented in a microprocessor environment, such as the example environment of FIG. 2A. Generally, method 900 operates with the CCP providing a state where during a period in which one processing unit has occupied a cache line for writing, other processing unit(s) may request the current value of the cache line. An example of this aspect is the "owned" state in MOESI CCP described earlier.

Method 900 begins with act 901 where a cache modifies a cache line in an owned state with logging disabled in act 901. For example, processing unit A1 may have considered the cache line as being "owned" by cache L1-A1. During this time, processing unit A1 may perform a write to the cache line. As discussed in connection with FIG. 8, clearing the logging state for a cache line when logging is disabled may be a natural part of performing a write operation. As such, method 900 does not depict any fast actions for clearing logging state, but in some implementations there may be fast actions.

Based on act 901, method 900 illustrates that action(s) may be taken to communicate that the log state of a cache line should also be purged on other cache(s). In act 902, upon request, the cache notifies the peer cache to clear the logging status for the cacheline. For example, after a write to a cache line owned by cache L1-A1 has been performed, cache L1-A1 may receive a request (e.g., a CCP message) from a peer cache, such as cache L2-A2, to request the current value of the cache line. As a result of this request, cache L1-A1 may notify cache L2-A2 that the journaling state in its corresponding cache line should be cleared if the journaling state has been set. The notification may be sent with the CCP message conveying the current value of the cacheline in cache L1-A1 or as part of a separate message.

In act 903, based on modifying the cache line, the cache notifies one or more peer caches to clear the logging state for the cache line. For example, after performing a write operation to an owned cache line in cache L1-A1, cache L1-A1 may broadcast a notification to its peer caches (e.g., L1-A2-L1A 4) to let them know that the logging status for that cache line should be cleared (if a cache line exists in these caches and the logging status is set). Thus, while act 902 relatively notifies the peer cache to clear the logging status, act 903 proactively notifies the peer cache.

In act 904, based on modifying the cache line, the cache notifies an upper level cache layer to clear the logging state for the cache line. For example, after performing a write operation to an owned cache line in cache L1-A1, cache L1-A1 may broadcast a notification to its parent cache(s) (such as cache L2-A1 and/or L3-A) so that they know that the logging state for the cache line should be cleared (if the cache line exists in these caches and the logging state is set). Thus, similar to act 903, act 904 performs proactive notification, but this time for the parent cache(s), rather than the peer cache(s).

Notably, some implementations may perform more than one of acts 902-904. For example, one implementation may proactively notify the upper level cache(s) when a write is performed (i.e., act 904), but only reactively notify the peer cache(s) (i.e., act 902). In another example, an implementation may proactively notify the upper level cache(s) (i.e., act 904) and peer cache(s) (i.e., act 903).

Additionally, similar to act 705 of method 700 and act 806 of method 800, when performing 903, method 600 may correspondingly include the particular second cache receiving a message from the first cache indicating that another cache line in the first cache corresponding to the memory address is marked as not logged within the first cache. Method 600 may also include marking the cache line as not logged within the particular second cache based on the message.

Based on knowledge of the upper-level cache(s) (e.g., L2, L3, etc.), logging the inflow to the lower-level cache (e.g., L1) may provide several advantages over implementing logging by reference and in some cases avoiding logging. For example, the lower layer initiates the logging process only if data from a cache miss has actually been consumed by the processing unit. This may avoid, for example, logging cache misses due to speculative execution. Additionally, the lower layer may perform logging concurrently with the exit of the instruction that caused the caching activity. This may result in tracking that captures higher precision timing. Finally, when recording the lower layers, the recording can be based on virtual memory addressing, rather than physical memory addressing, if desired. It is noted that if logging is based on virtual memory addressing, there may be situations where multiple virtual addresses map to the same physical address. In these cases, the cache may not cause accesses to the same physical address through different virtual addresses to appear as cache misses. If this occurs, the tracker 104a may log data from the TLB 102 f. In some implementations, the virtual or physical addresses may be further distinguished by additional identifiers (e.g., virtual processor ID, security settings for memory addresses, etc.). In at least some of these implementations, the cache may cause accesses to the same address with different additional identifiers (or with higher, lower, or different security levels) to appear as cache misses.

Thus, embodiments herein provide different embodiments for recording bit-accurate "time travel" trace records based on using at least two tiers or layers of processor cache to trace execution effects across multiple processing units. Recording the trace file in this manner may require only modest processor modifications, and it may reduce the performance impact of trace recording and the size of the trace file by orders of magnitude when compared to existing trace recording methods.

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A microprocessor, comprising:

a plurality of processing units;

a plurality of caches arranged into a plurality of cache tiers, the plurality of caches including a plurality of first caches within a first cache tier and one or more second caches within a second cache tier, a particular second cache in the second cache tier serving as a backing store for at least a particular first cache in the first cache tier; and

control logic to configure at least the particular second cache to perform at least the following:

receiving a log record request referencing a particular memory address from the particular first cache; and

determining, based on the request, whether a cacheline corresponding to the memory address is present in the particular second cache, and

when the cacheline is not present in the particular second cache, performing one of:

causing the cache line to be logged when there is no third cache participating in logging and serving as a backing store for at least the particular second cache; or

Forwarding the request to the third cache when the third cache does exist; or

When the cacheline is present in the particular second cache, performing at least one of:

causing the cache line to be logged when (i) the cache line is not determined by the particular second cache to be logged, or (ii) the cache line is determined by the particular second cache to be logged, but the particular second cache has not determined that the first cache knows the current value stored in the cache line of the particular second cache; or

Determining that the cache line does not need to be logged when (i) the cache line is determined by the particular second cache to be logged and (ii) the first cache is determined to know the current value stored in the cache line of the particular second cache.

2. The microprocessor of claim 1, wherein causing the cache line to be logged comprises:

logging the cache line in a trace buffer; and

marking the cache line as being logged within the particular second cache.

3. The microprocessor of claim 1, wherein causing the cache line to be logged comprises one of:

instructing the first cache direct log to record a value of the particular memory address; or

Instructing the first cache log record to reference a previous log entry for the particular memory address.

4. The microprocessor of claim 1, wherein causing the cache line to be logged when the cache line is not determined by the particular second cache to be logged comprises:

determining that the third cache layer exists, an

Notifying the third cache that the cache line has been logged by the particular second cache by value.

5. The microprocessor of claim 1, wherein the first cache layer comprises an L1 cache layer, and wherein the second cache layer comprises an L2 cache layer or an L3 cache layer.

6. The microprocessor of claim 1, the control logic further to configure at least the particular second cache to:

receiving a message from the first cache indicating that another cache line in the first cache that also corresponds to the memory address is marked as not logged within the first cache; and

marking, based on the message, the cacheline as not being logged within the particular second cache.

7. The microprocessor of claim 6, wherein the other cache line is marked as not being logged within the first cache based on the other cache line being written by a processing unit corresponding to the first cache with logging disabled for the processing unit.

8. The microprocessor of claim 1, the control logic further to configure at least the particular second cache to:

receiving a message from the first cache requesting the cache line to be written; and

sending the cache line to the first cache, the cache line marked as not logged based at least on disabling logging for the first cache.

9. A method of an upper cache layer determining how a lower cache layer logs inflow based on logging requests of the lower cache layer, the method implemented at a computing device comprising (i) a plurality of processing units, (ii) a plurality of caches arranged into a plurality of cache layers, the plurality of caches comprising a plurality of first caches within a first cache layer and one or more second caches within a second cache layer, a particular second cache in the second cache layer serving as a backing store for at least a particular first cache in the first cache layer, the method comprising:

Forwarding the request to the third cache when the third cache does exist; or

10. The method of claim 9, wherein causing the cache line to be logged comprises:

logging the cache line in a trace buffer; and

marking the cache line as being logged within the particular second cache.

11. The method of claim 9, wherein causing the cache line to be logged comprises one of:

12. The method of claim 9, wherein causing the cache line to be logged when the cache line is not determined by the particular second cache to be logged comprises:

determining that the third cache exists, an

Causing the cache line to be logged by reference based on knowledge of the third cache.

13. The method of claim 9, wherein causing the cache line to be logged when the cache line is not determined by the particular second cache to be logged comprises:

determining that the third cache layer exists, an

14. The method of claim 9, wherein the first cache layer comprises an L1 cache layer, and wherein the second cache layer comprises an L2 cache layer or an L3 cache layer.

15. The method of claim 9, the control logic further to configure at least the particular second cache to: