US9268591B2 - Systems and methods for detecting system exceptions in guest operating systems - Google Patents
Systems and methods for detecting system exceptions in guest operating systems Download PDFInfo
- Publication number
- US9268591B2 US9268591B2 US13/655,139 US201213655139A US9268591B2 US 9268591 B2 US9268591 B2 US 9268591B2 US 201213655139 A US201213655139 A US 201213655139A US 9268591 B2 US9268591 B2 US 9268591B2
- Authority
- US
- United States
- Prior art keywords
- guest
- system exception
- detector module
- module
- virtual machine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
Definitions
- Virtual machines which are an abstraction of physical computing resources, may include a guest operating system that operates therein.
- Guest operating systems like operating systems in non-virtualized environments, are susceptible to system exceptions, or crashes.
- Datacenters may include thousands, tens of thousands, or more virtual machines that are operating concurrently.
- Known methods for determining the status of a guest operating system include viewing the console of the virtual machine, analyzing operating system files, and executing user-space software tools within the guest operating system. Using such methods, administrators of large datacenters may be unable to continuously monitor the status of each operating system to ensure that operating systems are operating as expected. Accordingly, there is a need for improved methods and systems for detecting guest operating system crashes and reporting such crashes to administrators.
- a module is provided for each guest operating system and is capable of intercepting system exceptions within the guest operating system.
- the module communicates with a hypervisor to provide a current status of the guest operating system.
- the module may collect system exception information, such as a memory dump.
- FIG. 1 is an exemplary virtual infrastructure having a plurality of virtual machines and a plurality of guest operating systems.
- FIG. 2 is a swimlane diagram of an exemplary method for detecting system exceptions in the guest operating systems of FIG. 1 .
- Embodiments provided herein enable system exceptions to be caught within guest operating systems and reported to virtual infrastructure administrators.
- system exception includes any operating system exception, fault, error, false assertion or other condition which may result in the termination of any or all operating system processes, functionality, and/or interactivity, whether caused by hardware, software, or otherwise.
- system exceptions include kernel panics and the “blue screen of death.”
- FIG. 1 is an exemplary virtual infrastructure 100 having a plurality of virtual machines (VMs) 105 on physical computer systems, or hosts, 110 and 114 , collectively known as a cluster 116 .
- Each VM 105 provides a virtual environment wherein a guest operating system 118 may reside and operate.
- Each physical computer 110 and 114 includes hardware 120 , a virtualization software or manager 122 running on hardware 120 , and one or more VMs 105 executing on the hardware 120 by way of virtualization software 122 .
- the virtualization software 122 is therefore logically interposed between, and interfaces with, the hardware 120 and the VMs 105 .
- the virtualization software 122 may be implemented wholly or in part in hardware, e.g., as a system-on-a-chip, firmware, field programmable gate array (FPGA), etc.
- the hardware 120 includes at least one processor (not shown), wherein each processor is an execution unit, or “core,” on a microprocessor chip.
- the hardware 120 also includes a system memory (not shown), which is a general volatile random access memory (RAM), a network interface port (NIC) (not shown), a storage system (not shown), and other devices.
- the virtualization software 122 is sometimes referred to as a hypervisor, and includes software components for managing hardware resources and software components for virtualizing or emulating physical devices to provide virtual devices, such as virtual disks, virtual processors, virtual network interfaces, etc., for each VM 105 .
- each VM 105 is an abstraction of a physical computer system and may include an operating system (OS) 118 , such as Microsoft Windows® and applications, which are referred to as the “guest OS” 118 and “guest applications,” respectively, wherein the term “guest” indicates it is a software entity that resides within the VM.
- OS operating system
- guest OS guest OS
- guest applications guest applications
- a Virtual Machine Management Server (VMMS) 125 provides a software interface 127 that, among other things, allows users and other programs to control the lifecycle of VMs 105 running on physical computers 110 and 114 that are managed by VMMS 125 .
- VMMS 125 may provide other VM management and manipulations than those specifically mentioned here.
- VMMS 125 may include products such as vCenter and VMware Service Manager, both available from VMware, Inc. of Palo Alto, Calif.
- the virtualization software 122 includes a query module 130 that may be implemented as a kernel-level module.
- the query module 130 is configured to communicate with each guest OS 118 associated with the virtualization software 122 . More particularly, the query module 130 is configured to communicate with a virtual machine crash detector module 135 , or detector module, that may be associated with each guest OS 118 .
- the detector module 135 may be implemented as a kernel-level module.
- the query module 130 is configured to determine the status of a guest OS 118 by communicating with the detector module 135 associated with that guest OS 118 .
- the query module 130 generates queries and transmits the queries to the detector module 135 .
- the queries may include a request for guest OS status and a request for crash information, among other things.
- an “ISALIVE” query may be transmitted by the query module 130 to request a current status of the guest OS 118 .
- a “GETCRASHINFO” query may be transmitted by the query module 130 to request information available from the guest OS 118 about a system exception.
- the query module 130 may transmit guest OS status requests at regular intervals, such as every 5, 10, 15, 30, or 60 seconds, or on demand.
- the query module 130 may transmit guest OS status requests to all or some of the guest OSes 118 associated with the virtualization software 122 .
- the query module 130 may transmit guest OS status requests using round-robin scheduling to two or more guest OSes 118 .
- Each guest OS status may be transmitted to VMMS 125 to provide the status of guest OSes 118 within virtual infrastructure 100 .
- VMMS 125 may be used to cause the query module 130 to initiate a guest OS status request.
- VMMS 125 may be configured to store and update the current status of guest OSes 118 such that a user of VMMS 125 may determine which guest OSes 118 are not operational due to a system exception.
- VMMS 125 may be configured to alert the user, e.g., with a displayed message, an audible indicator, an email, etc., when a guest OS 118 has reported a system exception.
- the detector module 135 is configured to detect system exceptions. More particularly, the detector module 135 is configured to intercept or otherwise handle system exceptions. For example, the detector module 135 may intercept calls to a system-wide exception handler. In the exemplary embodiment, the detector module 135 is still operable after a system exception has been raised and can respond to queries from the query module 130 . Alternatively, if the detector module 135 is unable to respond to queries after a system exception, the query module 130 may interpret unanswered queries as an indication that the guest OS 118 has experienced a system exception. Unanswered queries may include queries that have not received a response in a pre-determined period of time, such as 500 ms.
- the detector module 135 may respond with an acknowledgement, such as “ACK”, to indicate that no system exception has occurred or a system exception indicator message that indicates that a system exception has occurred, such as “CRASH”.
- the query module 130 may request system exception information from the detector module 135 .
- the detector module 135 may collect system exception information, such as memory dumps, system logs, stack traces, etc. After the system exception information has been collected by the detector module 135 , the detector module 135 may respond to the system exception information request with a message, such as “SENDCRASHINFO”, and the collected system exception information.
- the detector module 135 may pass the system exception back to the guest OS for processing. For example, the detector module 135 may invoke a system exception handler for routine processing of the system exception. By intercepting the system exception and not allowing the system exception handler to be invoked, the detector module 135 may be able to operate within the guest OS 118 even after a system exception has caused the guest OS 118 to halt execution of one or more processes.
- a communication channel 140 exists between the query module 130 and the detector module 135 to enable communication between the virtualization software 122 and the guest OS 118 .
- the communication channel 140 enables direct communication that may continue even after the guest OS 118 experiences a system exception.
- the communication channel 140 may be implemented as an application programming interface (API) that provides calls and/or protocols for exchanging information between the virtualization software 122 and the guest OS 118 .
- API application programming interface
- the communication channel 140 may be implemented using a Virtual Machine Communication Interface (VMCI) or using VMCI Sockets, both available from VMware, Inc. of Palo Alto, Calif.
- VMCI provides a communications API similar to Berkeley UNIX sockets and Windows sockets for transmitting datagrams and/or sharing memory.
- UNIX is a registered trademark of The Open Group.
- the query module 130 transmits a guest OS status request message to the detector module 135 .
- the detector module 135 responds with a guest OS status message, which is generally an acknowledgement that indicates that no system exception has occurred in the guest OS 118 .
- the query module 130 continues to transmit guest OS status request messages to the detector module 135 as long as acknowledgements are being returned.
- the detector module 135 intercepts the system exception.
- the detector module 135 waits until the query module 130 transmits the next guest OS status request message, at which time the detector module 135 transmits a system exception indicator message.
- the detector module 135 may transmit the system exception indicator message to the query module 130 regardless of whether a guest OS status request message has been received. In other words, rather than waiting to be polled, the detector module 135 may push a system exception indicator message to the query module 130 .
- the query module 130 transmits a system exception information request to the detector module 135 .
- the detector module 135 in response to the system exception information request, causes system exception information to be collected. Once at least partially collected, the detector module 135 transmits a response to the system exception information request.
- the response may include a message and the collected system exception information.
- the detector module 135 passes the system exception to the system exception handler for routine processing. Once the system exception handler has received the system exception, the guest OS 118 may halt one or more processes and may become unresponsive.
- the query module 130 transmits a message to VMMS 125 indicating that a system exception has been detected. The system exception information may also be transmitted to VMMS 125 .
- the query module 130 may store the system exception information in a data store 145 that is accessible to VMMS 125 and transmit the location of the system exception information with the system exception notification message.
- the data store 145 may be a network attached storage device, a network resource shared by the virtualization software 122 , a virtual storage device in a guest OS, a database, etc.
- VMMS 125 alerts a user of the system exception and the availability of system exception information, if applicable. The user may then use VMMS 125 to restart the VM 105 associated with the guest OS 118 that experienced a system exception.
- the virtual infrastructure 100 is capable of detecting and reporting system exceptions, or crashes, within guest OSes 118 .
- FIG. 2 is a swimlane diagram of an exemplary method 200 for detecting system exceptions in the guest OSes 118 in the virtual infrastructure 100 shown in FIG. 1 .
- the virtualization software 122 may provide the detector module 135 to the guest OS 118 .
- the detector module 135 may be provided to the guest OS 118 as a kernel module that may be inserted into a running kernel in the guest OS 118 .
- the virtualization software 122 may provide a detector module that is compatible with the guest OS 118 .
- the detector module 135 may be provided as part of a collection of software, drivers, modules, and other tools for providing additional functionality to the guest OS 118 within a virtual environment, such as virtual machine 105 (shown in FIG. 1 ).
- the query module 130 transmits a guest OS status request message to the detector module 135 .
- the guest OS status request message, and other messages and data, may be transmitted to and from the detector module 135 using the communication channel 140 (shown in FIG. 1 ).
- the detector module 135 responds to the query module 130 with an acknowledgement that indicates that no system exception has occurred in the guest OS 118 . Operation 206 and operation 209 may be repeated together any number of times until a system exception occurs in the guest OS 118 .
- the guest OS 118 experiences a system exception and raises the system exception, which is intercepted by the detector module 135 .
- the detector module 135 receives, or catches, the system exception from the guest OS 118 .
- the query module 130 transmits a guest OS status request message to the detector module 135 .
- the detector module 135 in operation 218 , responds with a system exception indicator message that indicates that the guest OS 118 has experienced a system exception.
- the query module 130 may, in operation 221 , transmit a system exception information request message.
- the detector module 135 may collect system exception information, which may include memory dumps, system logs, stack traces, etc., from the guest OS 118 and/or the VM environment.
- the detector module 135 transmits a system exception information message and the collected system exception information.
- the detector module 135 may, in operation 230 , pass the system exception back to the guest OS 118 for routine processing. For example, the detector module 135 may invoke, within the guest OS 118 , a system exception handler that would have received the system exception if the detector module 135 had not intercepted the system exception.
- the query module 130 having received the system exception indicator message and/or the system exception information, may, in operation 233 , directly or through operation of the virtualization software 122 , transmit a system exception notification message to VMMS 125 .
- the query module 130 may store the collected system exception information in the data store 145 (shown in FIG. 1 ).
- the system exception notification message may include details about the system exception, including the name of the associated VM, the guest OS type, the contents of the system exception information, the location of the system exception information (in the case where the query module 130 stored the system exception information in the data store 145 ), and other details relating to the system exception and/or the guest OS 118 .
- VMMS 125 may alert a user of VMMS 125 in operation 236 .
- the alert may be displayed as a message within a console or other user interface of VMMS 125 , such as software interface 127 .
- the alert may indicate which guest OS 118 generated the system exception, the presence or absence of system exception information, the host associated with the guest OS 118 , and/or the type of guest OS.
- the alert may also indicate a virtual machine name associated with the guest OS 118 , and/or other information about the guest OS 118 available from VMMS 125 .
- a computer or computing device may include one or more processors or processing units, system memory, and some form of computer readable media.
- Exemplary computer readable media include flash memory drives, digital versatile discs (DVDs), compact discs (CDs), floppy disks, and tape cassettes.
- computer readable media comprise computer storage media and communication media.
- Computer storage media store information such as computer readable instructions, data structures, program modules, or other data.
- Communication media typically embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and include any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
- embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations.
- Examples of well known computing systems, environments, and/or configurations that may be suitable for use with aspects of the invention include, but are not limited to, mobile computing devices, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, gaming consoles, microprocessor-based systems, set top boxes, programmable consumer electronics, mobile telephones, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
- Embodiments of the invention may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices.
- the computer-executable instructions may be organized into one or more computer-executable components or modules.
- program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types.
- aspects of the invention may be implemented with any number and organization of such components or modules. For example, aspects of the invention are not limited to the specific computer-executable instructions or the specific components or modules illustrated in the figures and described herein. Other embodiments of the invention may include different computer-executable instructions or components having more or less functionality than illustrated and described herein.
- aspects of the invention transform a general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
Claims (18)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/655,139 US9268591B2 (en) | 2012-10-18 | 2012-10-18 | Systems and methods for detecting system exceptions in guest operating systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/655,139 US9268591B2 (en) | 2012-10-18 | 2012-10-18 | Systems and methods for detecting system exceptions in guest operating systems |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140115575A1 US20140115575A1 (en) | 2014-04-24 |
US9268591B2 true US9268591B2 (en) | 2016-02-23 |
Family
ID=50486580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/655,139 Active 2033-07-04 US9268591B2 (en) | 2012-10-18 | 2012-10-18 | Systems and methods for detecting system exceptions in guest operating systems |
Country Status (1)
Country | Link |
---|---|
US (1) | US9268591B2 (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995759B (en) * | 2014-05-21 | 2015-04-29 | 中国人民解放军国防科学技术大学 | High-availability computer system failure handling method and device based on core internal-external synergy |
CN106155762A (en) * | 2015-04-14 | 2016-11-23 | 中兴通讯股份有限公司 | A kind of method, device and virtual management center managing virtual machine state |
CN107480033B (en) * | 2016-06-08 | 2020-12-11 | 阿里巴巴集团控股有限公司 | Virtual machine blue screen detection method and equipment |
CN108874610A (en) * | 2017-05-12 | 2018-11-23 | 精英电脑股份有限公司 | Automatic image monitoring method |
US10430261B2 (en) * | 2017-08-15 | 2019-10-01 | Vmware, Inc. | Detecting a guest operating system crash on a virtual computing instance |
CN112835781B (en) * | 2019-11-25 | 2024-09-13 | 上海哔哩哔哩科技有限公司 | Abnormality detection method and device for operation function |
CN111782431B (en) * | 2020-06-22 | 2025-03-11 | 深圳乐信软件技术有限公司 | A method, device, terminal and storage medium for handling abnormalities |
CN113849251B (en) * | 2020-06-28 | 2024-10-18 | 中兴通讯股份有限公司 | Virtual cloud desktop monitoring method, client, server and storage medium |
US11966767B2 (en) * | 2020-10-28 | 2024-04-23 | Dell Products, L.P. | Enabling dial home service requests from an application executing in an embedded environment |
JP7451438B2 (en) * | 2021-01-22 | 2024-03-18 | 株式会社東芝 | Communication devices, communication systems, notification methods and programs |
US12388765B2 (en) * | 2022-06-29 | 2025-08-12 | Microsoft Technology Licensing, Llc | Transmit side scaling and alignment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270825A1 (en) * | 2007-04-30 | 2008-10-30 | Garth Richard Goodson | System and method for failover of guest operating systems in a virtual machine environment |
US20110205585A1 (en) * | 2010-02-22 | 2011-08-25 | Canon Kabushiki Kaisha | Image processing system, image processing system control method, and storage medium |
-
2012
- 2012-10-18 US US13/655,139 patent/US9268591B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270825A1 (en) * | 2007-04-30 | 2008-10-30 | Garth Richard Goodson | System and method for failover of guest operating systems in a virtual machine environment |
US20110205585A1 (en) * | 2010-02-22 | 2011-08-25 | Canon Kabushiki Kaisha | Image processing system, image processing system control method, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
US20140115575A1 (en) | 2014-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9268591B2 (en) | Systems and methods for detecting system exceptions in guest operating systems | |
US11068309B2 (en) | Per request computer system instances | |
US10353725B2 (en) | Request processing techniques | |
US9098609B2 (en) | Health monitoring of applications in a guest partition | |
US11093270B2 (en) | Fast-booting application image | |
EP2733611B1 (en) | Internal fault handling method, device and system for virtual machine | |
US8365020B2 (en) | Mechanism for saving crash dump files of a virtual machine on a designated disk | |
US10061631B2 (en) | Detecting unresponsiveness of a process | |
US8595564B2 (en) | Artifact-based software failure detection | |
US10430261B2 (en) | Detecting a guest operating system crash on a virtual computing instance | |
TWI544328B (en) | Method and system for probe insertion via background virtual machine | |
US12019505B2 (en) | Event-based diagnostic information collection | |
US20160188361A1 (en) | Systems and methods for determining desktop readiness using interactive measures | |
US11050768B1 (en) | Detecting compute resource anomalies in a group of computing resources | |
US20250061187A1 (en) | Continual backup verification for ransomware detection and recovery | |
US12130695B2 (en) | Collecting crash-related information for a secure workspace |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: VMWARE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHAND, AVAKASH PREM;NAGARAJ, LAXMISHA;REEL/FRAME:029154/0296 Effective date: 20121018 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: VMWARE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:VMWARE, INC.;REEL/FRAME:067102/0395 Effective date: 20231121 |