[Xen-devel] [RFC, PATCH 1/24] i386 Vmi documentation

Index: linux-2.6.16-rc5/Documentation/vmi_spec.txt
===================================================================
--- linux-2.6.16-rc5.orig/Documentation/vmi_spec.txt    2006-03-09 
23:33:29.000000000 -0800
+++ linux-2.6.16-rc5/Documentation/vmi_spec.txt 2006-03-10 12:55:29.000000000 
-0800
@@ -0,0 +1,2197 @@
+
+       Paravirtualization API Version 2.0
+
+       Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
+       Copyright (C) 2005, 2006, VMware, Inc.
+       All rights reserved
+
+Revision history:
+         1.0: Initial version
+         1.1: arai 2005-11-15
+              Added SMP-related sections: AP startup and Local APIC support
+         1.2: dhecht 2006-02-23
+              Added Time Interface section and Time related VMI calls
+
+Contents
+
+1) Motivations
+2) Overview
+    Initialization
+    Privilege model
+    Memory management
+    Segmentation
+    Interrupt and I/O subsystem
+    IDT management
+    Transparent Paravirtualization
+    3rd Party Extensions
+    AP Startup
+    State Synchronization in SMP systems
+    Local APIC Support
+    Time Interface
+3) Architectural Differences from Native Hardware
+4) ROM Implementation
+    Detection
+    Data layout
+    Call convention
+    PCI implementation
+
+Appendix A - VMI ROM low level ABI
+Appendix B - VMI C prototypes
+Appendix C - Sensitive x86 instructions
+
+
+1) Motivations
+
+   There are several high level goals which must be balanced in designing
+   an API for paravirtualization.  The most general concerns are:
+
+   Portability      - it should be easy to port a guest OS to use the API 
+   High performance - the API must not obstruct a high performance
+                     hypervisor implementation 
+   Maintainability  - it should be easy to maintain and upgrade the guest
+                      OS 
+   Extensibility    - it should be possible for future expansion of the
+                      API 
+
+   Portability.
+
+     The general approach to paravirtualization rather than full
+     virtualization is to modify the guest operating system.  This means
+     there is implicitly some code cost to port a guest OS to run in a
+     paravirtual environment.  The closer the API resembles a native
+     platform which the OS supports, the lower the cost of porting.
+     Rather than provide an alternative, high level interface for this
+     API, the approach is to provide a low level interface which
+     encapsulates the sensitive and performance critical parts of the
+     system.  Thus, we have direct parallels to most privileged
+     instructions, and the process of converting a guest OS to use these
+     instructions is in many cases a simple replacement of one function
+     for another. Although this is sufficient for CPU virtualization,
+     performance concerns have forced us to add additional calls for
+     memory management, and notifications about updates to certain CPU
+     data structures. Support for this in the Linux operating system has
+     proved to be very minimal in cost because of the already somewhat
+     portable and modular design of the memory management layer.
+
+  High Performance.
+
+     Providing a low level API that closely resembles hardware does not
+     provide any support for compound operations; indeed, typical
+     compound operations on hardware can be updating of many page table
+     entries, flushing system TLBs, or providing floating point safety.
+     Since these operations may require several privileged or sensitive
+     operations, it becomes important to defer some of these operations
+     until explicit flushes are issued, or to provide higher level
+     operations around some of these functions.  In order to keep with
+     the goal of portability, this has been done only when deemed
+     necessary for performance reasons, and we have tried to package
+     these compound operations into methods that are typically used in
+     guest operating systems.  In the future, we envision that additional
+     higher level abstractions will be added as an adjunct to the
+     low-level API.  These higher level abstractions will target large
+     bulk operations such as creation, and destruction of address spaces,
+     context switches, thread creation and control.
+
+  Maintainability.
+
+     In the course of development with a virtualized environment, it is
+     not uncommon for support of new features or higher performance to
+     require radical changes to the operation of the system.  If these
+     changes are visible to the guest OS in a paravirtualized system,
+     this will require updates to the guest kernel, which presents a
+     maintenance problem.  In the Linux world, the rapid pace of
+     development on the kernel means new kernel versions are produced
+     every few months.  This rapid pace is not always appropriate for end
+     users, so it is not uncommon to have dozens of different versions of
+     the Linux kernel in use that must be actively supported.  To keep
+     this many versions in sync with potentially radical changes in the
+     paravirtualized system is not a scalable solution.  To reduce the
+     maintenance burden as much as possible, while still allowing the
+     implementation to accommodate changes, the design provides a stable
+     ABI with semantic invariants.  The underlying implementation of the
+     ABI and details of what data or how it communicates with the
+     hypervisor are not visible to the guest OS.  As a result, in most
+     cases, the guest OS need not even be recompiled to work with a newer
+     hypervisor.  This allows performance optimizations, bug fixes,
+     debugging, or statistical instrumentation to be added to the API
+     implementation without any impact on the guest kernel.  This is
+     achieved by publishing a block of code from the hypervisor in the
+     form of a ROM.  The guest OS makes calls into this ROM to perform
+     privileged or sensitive actions in the system.
+
+  Extensibility.
+
+     In order to provide a vehicle for new features, new device support,
+     and general evolution, the API uses feature compartmentalization
+     with controlled versioning.  The API is split into sections, with
+     each section having independent versions.  Each section has a top
+     level version which is incremented for each major revision, with a
+     minor version indicating incremental level.  Version compatibility
+     is based on matching the major version field, and changes of the
+     major version are assumed to break compatibility.  This allows
+     accurate matching of compatibility.  In the event of incompatible
+     API changes, multiple APIs may be advertised by the hypervisor if it
+     wishes to support older versions of guest kernels.  This provides
+     the most general forward / backward compatibility possible.
+     Currently, the API has a core section for CPU / MMU virtualization
+     support, with additional sections provided for each supported device
+     class.
+
+2) Overview
+
+   Initialization.
+
+     Initialization is done with a bootstrap loader that creates
+     the "start of day" state.  This is a known state, running 32-bit
+     protected mode code with paging enabled.  The guest has all the
+     standard structures in memory that are provided by a native ROM
+     boot environment, including a memory map and ACPI tables.  For
+     the native hardware, this bootstrap loader can be run before
+     the kernel code proper, and this environment can be created
+     readily from within the hypervisor for the virtual case.  At
+     some point, the bootstrap loader or the kernel itself invokes
+     the initialization call to enter paravirtualized mode.
+
+   Privilege Model.
+
+     The guest kernel must be modified to run at a dynamic privilege
+     level, since if entry to paravirtual mode is successful, the kernel
+     is no longer allowed to run at the highest hardware privilege level.
+     On the IA-32 architecture, this means the kernel will be running at
+     CPL 1-2, and with the hypervisor running at CPL0, and user code at
+     CPL3.  The IOPL will be lowered as well to avoid giving the guest
+     direct access to hardware ports and control of the interrupt flag.
+
+     This change causes certain IA-32 instructions to become "sensitive",
+     so additional support for clearing and setting the hardware
+     interrupt flag are present.  Since the switch into paravirtual mode
+     may happen dynamically, the guest OS must not rely on testing for a
+     specific privilege level by checking the RPL field of segment
+     selectors, but should check for privileged execution by performing
+     an (RPL != 3 && !EFLAGS_VM) comparison.  This means the DPL of kernel
+     ring descriptors in the GDT or LDT may be raised to match the CPL of
+     the kernel.  This change is visible by inspecting the segments
+     registers while running in privileged code, and by using the LAR
+     instruction.
+
+     The system also cannot be allowed to write directly to the hardware
+     GDT, LDT, IDT, or TSS, so these data structures are maintained by the
+     hypervisor, and may be shadowed or guest visible structures.  These
+     structures are required to be page aligned to support non-shadowed
+     operation.
+
+     Currently, the system only provides for two guest security domains,
+     kernel (which runs at the equivalent of virtual CPL-0), and user
+     (which runs at the equivalent of virtual CPL-3, with no hardware
+     access).  Typically, this is not a problem, but if a guest OS relies
+     on using multiple hardware rings for privilege isolation, this
+     interface would need to be expanded to support that.
+
+   Memory Management.
+
+     Since a virtual machine typically does not have access to all the
+     physical memory on the machine, there is a need to redefine the
+     physical address space layout for the virtual machine.  The
+     spectrum of possibilities ranges from presenting the guest with
+     a view of a physically contiguous memory of a boot-time determined
+     size, exactly what the guest would see when running on hardware, to
+     the opposite, which presents the guest with the actual machine pages
+     which the hypervisor has allocated for it.  Using this approach
+     requires the guest to obtain information about the pages it has
+     from the hypervisor; this can be done by using the memory map which
+     would normally be passed to the guest by the BIOS.
+
+     The interface is designed to support either mode of operation.
+     This allows the implementation to use either direct page tables
+     or shadow page tables, or some combination of both.  All writes to
+     page table entries are done through calls to the hypervisor
+     interface layer.  The guest notifies the hypervisor about page
+     tables updates, flushes, and invalidations through API calls.
+
+     The guest OS is also responsible for notifying the hypervisor about
+     which pages in its physical memory are going to be used to hold page
+     tables or page directories.  Both PAE and non-PAE paging modes are
+     supported.  When the guest is finished using pages as page tables, it
+     should release them promptly to allow the hypervisor to free the
+     page table shadows.  Using a page as both a page table and a page
+     directory for linear page table access is possible, but currently
+     not supported by our implementation.
+
+     The hypervisor lives concurrently in the same address space as the
+     guest operating system.  Although this is not strictly necessary on
+     IA-32 hardware, performance would be severely degraded if that were
+     not the case.  The hypervisor must therefore reserve some portion of
+     linear address space for its own use. The implementation currently
+     reserves the top 64 megabytes of linear space for the hypervisor.
+     This requires the guest to relocate any data in high linear space
+     down by 64 megabytes.  For non-paging mode guests, this means the
+     high 64 megabytes of physical memory should be reserved.  Because
+     page tables are not sensitive to CPL, only to user/supervisor level,
+     the hypervisor must combine segment protection to ensure that the
+     guest can not access this 64 megabyte region.
+
+     An experimental patch is available to enable boot-time sizing of
+     the hypervisor hole.
+
+   Segmentation.
+
+     The IA-32 architecture provides segmented virtual memory, which can
+     be used as another form of privilege separation.  Each segment
+     contains a base, limit, and properties.  The base is added to the
+     virtual address to form a linear address.  The limit determines the
+     length of linear space which is addressable through the segment.
+     The properties determine read/write, code and data size of the
+     region, as well as the direction in which segments grow.  Segments
+     are loaded from descriptors in one of two system tables, the GDT or
+     the LDT, and the values loaded are cached until the next load of the
+     segment.  This property, known as segment caching, allows the
+     machine to be put into a non-reversible state by writing over the
+     descriptor table entry from which a segment was loaded.  There is no
+     efficient way to extract the base field of the segment after it is
+     loaded, as it is hidden by the processor.  In a hypervisor
+     environment, the guest OS can be interrupted at any point in time by
+     interrupts and NMIs which must be serviced by the hypervisor.  The
+     hypervisor must be able to recreate the original guest state when it
+     is done servicing the external event.
+
+     To avoid creating non-reversible segments, the hypervisor will
+     forcibly reload any live segment registers that are updated by
+     writes to the descriptor tables.  *N.B - in the event that a segment
+     is put into an invalid or not present state by an update to the
+     descriptor table, the segment register must be forced to NULL so
+     that reloading it will not cause a general protection fault (#GP)
+     when restoring the guest state.  This may require the guest to save
+     the segment register value before issuing a hypervisor API call
+     which will update the descriptor table.*
+
+     Because the hypervisor must protect its own memory space from
+     privileged code running in the guest at CPL1-2, descriptors may not
+     provide access to the 64 megabyte region of high linear space.  To
+     achieve this, the hypervisor will truncate descriptors in the
+     descriptor tables.  This means that attempts by the guest to access
+     through negative offsets to the segment base will fault, so this is
+     highly discouraged (some TLS implementations on Linux do this).
+     In addition, this causes the truncated length of the segment to
+     become visible to the guest through the LSL instruction.
+
+   Interrupt and I/O Subsystem.
+
+     For security reasons, the guest operating system is not given
+     control over the hardware interrupt flag.  We provide a virtual
+     interrupt flag that is under guest control.  The virtual operating
+     system always runs with hardware interrupts enabled, but hardware
+     interrupts are transparent to the guest.  The API provides calls for
+     all instructions which modify the interrupt flag.
+
+     The paravirtualization environment provides a legacy programmable
+     interrupt controller (PIC) to the virtual machine.  Future releases
+     will provide a virtual interrupt controller (VIC) that provides
+     more advanced features.
+
+     In addition to a virtual interrupt flag, there is also a virtual
+     IOPL field which the guest can use to enable access to port I/O
+     from userspace for privileged applications.
+
+     Generic PCI based device probing is available to detect virtual
+     devices.  The use of PCI is pragmatic, since it allows a vendor
+     ID, class ID, and device ID to identify the appropriate driver
+     for each virtual device.
+
+   IDT Management.
+
+     The paravirtual operating environment provides the traditional x86
+     interrupt descriptor table for handling external interrupts,
+     software interrupts, and exceptions.  The interrupt descriptor table
+     provides the destination code selector and EIP for interruptions.
+     The current task state structure (TSS) provides the new stack
+     address to use for interruptions that result in a privilege level
+     change.  The guest OS is responsible for notifying the hypervisor
+     when it updates the stack address in the TSS.
+
+     Two types of indirect control flow are of critical importance to the
+     performance of an operating system.  These are system calls and page
+     faults.  The guest is also responsible for calling out to the
+     hypervisor when it updates gates in the IDT.  Making IDT and TSS
+     updates known to the hypervisor in this fashion allows efficient
+     delivery through these performance critical gates.
+
+   Transparent Paravirtualization.
+
+     The guest operating system may provide an alternative implementation
+     of the VMI option rom compiled in.  This implementation should
+     provide implementations of the VMI calls that are suitable for
+     running on native x86 hardware.  This code may be used by the guest
+     operating system while it is being loaded, and may also be used if
+     the operating system is loaded on hardware that does not support
+     paravirtualization.
+
+     When the guest detects that the VMI option rom is available, it
+     replaces the compiled-in version of the rom with the rom provided by
+     the platform.  This can be accomplished by copying the rom contents,
+     or by remapping the virtual address containing the compiled-in rom
+     to point to the platform's ROM.  When booting on a platform that
+     does not provide a VMI rom, the operating system can continue to use
+     the compiled-in version to run in a non-paravirtualized fashion.
+
+   3rd Party Extensions.
+
+     If desired, it should be possible for 3rd party virtual machine
+     monitors to implement a paravirtualization environment that can run
+     guests written to this specification.
+
+     The general mechanism for providing customized features and
+     capabilities is to provide notification of these feature through
+     the CPUID call, and allowing configuration of CPU features
+     through RDMSR / WRMSR instructions.  This allows a hypervisor vendor
+     ID to be published, and the kernel may enable or disable specific
+     features based on this id.  This has the advantage of following
+     closely the boot time logic of many operating systems that enables
+     certain performance enhancements or bugfixes based on processor
+     revision, using exactly the same mechanism.
+
+     An exact formal specification of the new CPUID functions and which
+     functions are vendor specific is still needed.
+
+   AP Startup.
+
+     Application Processor startup in paravirtual SMP systems works a bit 
+     differently than in a traditional x86 system.
+
+     APs will launch directly in paravirtual mode with initial state
+     provided by the BSP.  Rather than the traditional init/startup
+     IPI sequence, the BSP must issue the init IPI, a set application
+     processor state hypercall, followed by the startup IPI.
+
+     The initial state contains the AP's control registers, general
+     purpose registers and segment registers, as well as the IDTR, 
+     GDTR, LDTR and EFER.  Any processor state not included in the initial
+     AP state (including x87 FPRs, SSE register states, and MSRs other than
+     EFER), are left in the poweron state.
+
+     The BSP must construct the initial GDT used by each AP.  The segment
+     register hidden state will be loaded from the GDT specified in the
+     initial AP state.  The IDT and (if used) LDT may either be constructed by
+     the BSP or by the AP.
+
+     Similarly, the initial page tables used by each AP must also be
+     constructed by the BSP.
+
+     If an AP's initial state is invalid, or no initial state is provided
+     before a start IPI is received by that AP, then the AP will fail to start.
+     It is therefore advisable to have a timeout for waiting for AP's to start,
+     as is recommended for traditional x86 systems.
+
+     See VMI_SetInitialAPState in Appendix A for a description of the
+     VMI_SetInitialAPState hypercall and the associated APState data structure.
+  
+   State Synchronization In SMP Systems.
+     
+     Some in-memory data structures that may require no special synchronization
+     on a traditional x86 systems need special handling when run on a 
+     hypervisor.  Two of particular note are the descriptor tables and page
+     tables.
+
+     Each processor in an SMP system should have its own GDT and LDT.  Changes
+     to each processor's descriptor tables must be made on that processor
+     via the appropriate VMI calls.  There is no VMI interface for updating
+     another CPU's descriptor tables (aside from VMI_SetInitialAPState),
+     and the result of memory writes to other processors' descriptor tables
+     are undefined.
+     
+     Page tables have slightly different semantics than in a traditional x86
+     system.  As in traditional x86 systems, page table writes may not be
+     respected by the current CPU until a TLB flush or invlpg is issued.
+     In a paravirtual system, the hypervisor implementation is free to 
+     provide either shared or private caches of the guest's page tables.
+     Page table updates must therefore be propagated to the other CPUs
+     before they are guaranteed to be noticed.
+     
+     In particular, when doing TLB shootdown, the initiating processor
+     must ensure that all deferred page table updates are flushed to the
+     hypervisor, to ensure that the receiving processor has the most up-to-date
+     mapping when it performs its invlpg.
+
+   Local APIC Support.
+
+     A traditional x86 local APIC is provided by the hypervisor.  The local
+     APIC is enabled and its address is set via the IA32_APIC_BASE MSR, as
+     usual.  APIC registers may be read and written via ordinary memory
+     operations.
+
+     For performance reasons, higher performance APIC read and write interfaces
+     are provided.  If possible, these interfaces should be used to access
+     the local APIC.
+
+     The IO-APIC is not included in this spec, as it is typically not
+     performance critical, and used mainly for initial wiring of IRQ pins.
+     Currently, we implement a fully functional IO-APIC with all the
+     capabilities of real hardware.  This may seem like an unnecessary burden,
+     but if the goal is transparent paravirtualization, the kernel must
+     provide fallback support for an IO-APIC anyway.  In addition, the
+     hypervisor must support an IO-APIC for SMP non-paravirtualized guests.
+     The net result is less code on both sides, and an already well defined
+     interface between the two.  This avoids the complexity burden of having
+     to support two different interfaces to achieve the same task.
+
+     One shortcut we have found most helpful is to simply disable NMI delivery
+     to the paravirtualized kernel.  There is no reason NMIs can't be
+     supported, but typical uses for them are not as productive in a
+     virtualized environment.  Watchdog NMIs are of limited use if the OS is
+     already correct and running on stable hardware; profiling NMIs are
+     similarly of less use, since this task is accomplished with more accuracy
+     in the VMM itself; and NMIs for machine check errors should be handled
+     outside of the VM.  The addition of NMI support does create additional
+     complexity for the trap handling code in the VM, and although the task is
+     surmountable, the value proposition is debatable.  Here, again, feedback
+     is desired.
+
+   Time Interface.
+
+     In a virtualized environment, virtual machines (VM) will time share
+     the system with each other and with other processes running on the
+     host system.  Therefore, a VM's virtual CPUs (VCPUs) will be
+     executing on the host's physical CPUs (PCPUs) for only some portion
+     of time.  This section of the VMI exposes a paravirtual view of
+     time to the guest operating systems so that they may operate more
+     effectively in a virtual environment.  The interface also provides
+     a way for the VCPUs to set alarms in this paravirtual view of time.
+
+     Time Domains:
+
+     a) Wallclock Time:
+
+     Wallclock time exposed to the VM through this interface indicates
+     the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO
+     8601 date format).  If the host's wallclock time changes (say, when
+     an error in the host's clock is corrected), so does the wallclock
+     time as viewed through this interface.
+
+     b) Real Time:
+
+     Another view of time accessible through this interface is real
+     time.  Real time always progresses except for when the VM is
+     stopped or suspended.  Real time is presented to the guest as a
+     counter which increments at a constant rate defined (and presented)
+     by the hypervisor.  All the VCPUs of a VM share the same real time
+     counter.
+
+     The unit of the counter is called "cycles".  The unit and initial
+     value (corresponding to the time the VM enters para-virtual mode)
+     are chosen by the hypervisor so that the real time counter will not
+     rollover in any practical length of time.  It is expected that the
+     frequency (cycles per second) is chosen such that this clock
+     provides a "high-resolution" view of time.  The unit can only
+     change when the VM (re)enters paravirtual mode.
+
+     c) Stolen time and Available time:
+
+     A VCPU is always in one of three states: running, halted, or ready.
+     The VCPU is in the 'running' state if it is executing.  When the
+     VCPU executes the HLT interface, the VCPU enters the 'halted' state
+     and remains halted until there is some work pending for the VCPU
+     (e.g. an alarm expires, host I/O completes on behalf of virtual
+     I/O).  At this point, the VCPU enters the 'ready' state (waiting
+     for the hypervisor to reschedule it).  Finally, at any time when
+     the VCPU is not in the 'running' state nor the 'halted' state, it
+     is in the 'ready' state.
+
+     For example, consider the following sequence of events, with times
+     given in real time:
+
+     (Example 1)
+
+     At 0 ms, VCPU executing guest code.
+     At 1 ms, VCPU requests virtual I/O.
+     At 2 ms, Host performs I/O for virtual I/0.
+     At 3 ms, VCPU executes VMI_Halt.
+     At 4 ms, Host completes I/O for virtual I/O request.
+     At 5 ms, VCPU begins executing guest code, vectoring to the interrupt 
+              handler for the device initiating the virtual I/O.
+     At 6 ms, VCPU preempted by hypervisor.
+     At 9 ms, VCPU begins executing guest code.
+
+     From 0 ms to 3 ms, VCPU is in the 'running' state.  At 3 ms, VCPU
+     enters the 'halted' state and remains in this state until the 4 ms
+     mark.  From 4 ms to 5 ms, the VCPU is in the 'ready' state.  At 5
+     ms, the VCPU re-enters the 'running' state until it is preempted by
+     the hypervisor at the 6 ms mark.  From 6 ms to 9 ms, VCPU is again
+     in the 'ready' state, and finally 'running' again after 9 ms.
+
+     Stolen time is defined per VCPU to progress at the rate of real
+     time when the VCPU is in the 'ready' state, and does not progress
+     otherwise.  Available time is defined per VCPU to progress at the
+     rate of real time when the VCPU is in the 'running' and 'halted'
+     states, and does not progress when the VCPU is in the 'ready'
+     state.
+
+     So, for the above example, the following table indicates these time
+     values for the VCPU at each ms boundary:
+
+     Real time    Stolen time    Available time
+      0            0              0
+      1            0              1
+      2            0              2
+      3            0              3
+      4            0              4
+      5            1              4
+      6            1              5
+      7            2              5
+      8            3              5
+      9            4              5
+     10            4              6
+
+     Notice that at any point:
+        real_time == stolen_time + available_time
+
+     Stolen time and available time are also presented as counters in
+     "cycles" units.  The initial value of the stolen time counter is 0.
+     This implies the initial value of the available time counter is the
+     same as the real time counter.
+
+     Alarms:
+
+     Alarms can be set (armed) against the real time counter or the
+     available time counter. Alarms can be programmed to expire once
+     (one-shot) or on a regular period (periodic).  They are armed by
+     indicating an absolute counter value expiry, and in the case of a
+     periodic alarm, a non-zero relative period counter value.  [TBD:
+     The method of wiring the alarms to an interrupt vector is dependent
+     upon the virtual interrupt controller portion of the interface.
+     Currently, the alarms may be wired as if they are attached to IRQ0
+     or the vector in the local APIC LVTT.  This way, the alarms can be
+     used as drop in replacements for the PIT or local APIC timer.]
+
+     Alarms are per-vcpu mechanisms.  An alarm set by vcpu0 will fire
+     only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.
+     If an alarm is set relative to available time, its expiry is a
+     value relative to the available time counter of the vcpu that set
+     it.
+
+     The interface includes a method to cancel (disarm) an alarm.  On
+     each vcpu, one alarm can be set against each of the two counters
+     (real time and available time).  A vcpu in the 'halted' state
+     becomes 'ready' when any of its alarm's counters reaches the
+     expiry.
+
+     An alarm "fires" by signaling the virtual interrupt controller.  An
+     alarm will fire as soon as possible after the counter value is
+     greater than or equal to the alarm's current expiry.  However, an
+     alarm can fire only when its vcpu is in the 'running' state.
+
+     If the alarm is periodic, a sequence of expiry values,
+
+      E(i) = e0 + p * i ,  i = 0, 1, 2, 3, ...
+
+     where 'e0' is the expiry specified when setting the alarm and 'p'
+     is the period of the alarm, is used to arm the alarm.  Initially,
+     E(0) is used as the expiry.  When the alarm fires, the next expiry
+     value in the sequence that is greater than the current value of the
+     counter is used as the alarm's new expiry.
+
+     One-shot alarms have only one expiry.  When a one-shot alarm fires,
+     it is automatically disarmed.
+
+     Suppose an alarm is set relative to real time with expiry at the 3
+     ms mark and a period of 2 ms.  It will expire on these real time
+     marks: 3, 5, 7, 9.  Note that even if the alarm does not fire
+     during the 5 ms to 7 ms interval, the alarm can fire at most once
+     during the 7 ms to 9 ms interval (unless, of course, it is
+     reprogrammed).
+
+     If an alarm is set relative to available time with expiry at the 1
+     ms mark (in available time) and with a period of 2 ms, then it will
+     expire on these available time marks: 1, 3, 5.  In the scenario
+     described in example 1, those available time values correspond to
+     these values in real time: 1, 3, 6.
+
+3) Architectural Differences from Native Hardware.
+
+     For the sake of performance, some requirements are imposed on kernel
+     fault handlers which are not present on real hardware.  Most modern
+     operating systems should have no trouble meeting these requirements.
+     Failure to meet these requirements may prevent the kernel from
+     working properly.
+
+     1) The hardware flags on entry to a fault handler may not match
+        the EFLAGS image on the fault handler stack.  The stack image
+        is correct, and will have the correct state of the interrupt
+        and arithmetic flags.
+
+     2) The stack used for kernel traps must be flat - that is, zero base,
+        segment limit determined by the hypervisor.
+
+     3) On entry to any fault handler, the stack must have sufficient space
+        to hold 32 bytes of data, or the guest may be terminated.
+
+     4) When calling VMI functions, the kernel must be running on a
+        flat 32-bit stack and code segment.
+
+     5) Most VMI functions require flat data and extra segment (DS and ES)
+        segments as well; notable exceptions are IRET and SYSEXIT.
+        XXXPara - may need to add STI and CLI to this list.
+
+     6) Interrupts must always be enabled when running code in userspace.
+
+     7) IOPL semantics for userspace are changed; although userspace may be
+        granted port access, it can not affect the interrupt flag.
+
+     8) The EIPs at which faults may occur in VMI calls may not match the
+        original native instruction EIP; this is a bug in the system
+        today, as many guests do rely on lazy fault handling.
+
+     9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.
+
+     10) Todo - we would like to support these features, but they are not
+        fully tested and / or implemented:
+
+        Userspace 16-bit stack support
+        Proper handling of faulting IRETs
+
+4) ROM Implementation
+
+   Modularization
+
+     Originally, we envisioned modularizing the ROM API into several
+     subsections, but the close coupling between the initial layers
+     and the requirement to support native PCI bus devices has made
+     ROM components for network or block devices unnecessary to this
+     point in time.
+
+       VMI - the virtual machine interface.  This is the core CPU, I/O
+             and MMU virtualization layer.  I/O is currently limited
+              to port access to emulated devices.
+       
+   Detection
+
+      The presence of hypervisor ROMs can be recognized by scanning the
+      upper region of the first megabyte of physical memory.  Multiple
+      ROMs may be provided to support older API versions for legacy guest
+      OS support.  ROM detection is done in the traditional manner, by
+      scanning the memory region from C8000h - DFFFFh in 2 kilobyte
+      increments.  The romSignature bytes must be '0x55, 0xAA', and the
+      checksum of the region indicated by the romLength field must be zero.
+      The checksum is a simple 8-bit addition of all bytes in the ROM region.
+
+   Data layout
+
+      typedef struct HyperRomHeader {
+         uint16_t        romSignature; 
+         int8_t          romLength;
+         unsigned char   romEntry[4];
+         uint8_t         romPad0;
+         uint32_t        hyperSignature;
+         uint8_t         APIVersionMinor;
+         uint8_t         APIVersionMajor;
+         uint8_t         reserved0;
+         uint8_t         reserved1;
+         uint32_t        reserved2;
+         uint32_t        reserved3;
+         uint16_t        pciHeaderOffset;
+         uint16_t        pnpHeaderOffset;
+         uint32_t        romPad3;
+         char            reserved[32];
+         char            elfHeader[64];
+      } HyperRomHeader;
+
+      The first set of fields is defined by the BIOS:
+
+      romSignature - fixed 0xAA55, BIOS ROM signature
+      romLength    - the length of the ROM, in 512 byte chunks.
+                     Determines the area to be checksummed.
+      romEntry     - 16-bit initialization code stub used by BIOS.
+      romPad0      - reserved
+
+      The next set of fields is defined by this API:
+
+      hyperSignature  - a 4 byte signature providing recognition of the
+                       device class represented by this ROM.  Each
+                       device class defines its own unique signature.
+      APIVersionMinor - the revision level of this device class' API.
+                       This indicates incremental changes to the API.
+      APIVersionMajor - the major version. Used to indicates large
+                       revisions or additions to the API which break
+                       compatibility with the previous version.
+      reserved0,1,2,3 - for future expansion
+
+      The next set of fields is defined by the PCI / PnP BIOS spec:
+
+      pciHeaderOffset - relative offset to the PCI device header from
+                       the start of this ROM.
+      pnpHeaderOffset - relative offset to the PnP boot header from the
+                       start of this ROM.
+      romPad3         - reserved by PCI spec.
+
+      Finally, there is space for future header fields, and an area
+      reserved for an ELF header to point to symbol information.
+
+Appendix A - VMI ROM Low Level ABI
+
+   OS writers intending to port their OS to the paravirtualizable x86
+   processor being modeled by this hypervisor need to access the
+   hypervisor through the VMI layer. It is possible although it is
+   currently unimplemented to add or replace the functionality of
+   individual hypervisor calls by providing your own ROM images. This is
+   intended to allow third party customizations.
+        
+   VMI compatible ROMs user the signature "cVmi" in the hyperSignature
+   field of the ROM header.
+
+   Many of these calls are compatible with the SVR4 C call ABI, using up
+   to three register arguments.   Some calls are not, due to restrictions
+   of the native instruction set.  Calls which diverge from this ABI are
+   noted.  In GNU terms, this means most of the calls are compatible with
+   regparm(3) argument passing.
+
+   Most of these calls behave as standard C functions, and as such, may
+   clobber registers EAX, EDX, ECX, flags.  Memory clobbers are noted
+   explicitly, since many of them may be inlined without a memory clobber.
+
+   Most of these calls require well defined segment conventions - that is,
+   flat full size 32-bit segments for all the general segments, CS, SS, DS,
+   ES.  Exceptions in some cases are noted.
+
+   The net result of these choices is that most of the calls are very
+   easy to make from C-code, and calls that are likely to be required in
+   low level trap handling code are easy to call from assembler.   Most
+   of these calls are also very easily implemented by the hypervisor
+   vendor in C code, and only the performance critical calls from
+   assembler paths require custom assembly implementations.
+
+   CORE INTERFACE CALLS
+   
+    This set of calls provides the base functionality to establish running
+    the kernel in VMI mode.
+
+    The interface will be expanded to include feature negotiation, more
+    explicit control over call bundling and flushing, and hypervisor
+    notifications to allow inline code patching.
+
+    VMI_Init
+   
+       VMICALL void VMI_Init(void);
+
+       Initializes the hypervisor environment.  Returns zero on success,
+       or -1 if the hypervisor could not be initialized.  Note that this
+       is a recoverable error if the guest provides the requisite native
+       code to support transparent paravirtualization.
+
+       Inputs:      None
+       Outputs:     EAX = result
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   PROCESSOR STATE CALLS
+
+    This set of calls controls the online status of the processor.  It
+    include interrupt control, reboot, halt, and shutdown functionality.
+    Future expansions may include deep sleep and hotplug CPU capabilities.
+
+    VMI_DisableInterrupts
+
+       VMICALL void VMI_DisableInterrupts(void);
+
+       Disable maskable interrupts on the processor.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Flags only
+       Segments:    As this is both performance critical and likely to
+          be called from low level interrupt code, this call does not
+          require flat DS/ES segments, but uses the stack segment for
+          data access.  Therefore only CS/SS must be well defined.
+
+    VMI_EnableInterrupts
+
+       VMICALL void VMI_EnableInterrupts(void);
+
+       Enable maskable interrupts on the processor.  Note that the
+       current implementation always will deliver any pending interrupts
+       on a call which enables interrupts, for compatibility with kernel
+       code which expects this behavior.  Whether this should be required
+       is open for debate.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Flags only
+       Segments:    CS/SS only
+
+    VMI_GetInterruptMask
+
+       VMICALL VMI_UINT VMI_GetInterruptMask(void);
+
+       Returns the current interrupt state mask of the processor.  The
+       mask is defined to be 0x200 (matching processor flag IF) to indicate
+       interrupts are enabled.
+
+       Inputs:      None
+       Outputs:     EAX = mask
+       Clobbers:    Flags only
+       Segments:    CS/SS only
+
+    VMI_SetInterruptMask
+   
+       VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
+
+       Set the current interrupt state mask of the processor.  Also
+       delivers any pending interrupts if the mask is set to allow
+       them.
+
+       Inputs:      EAX = mask
+       Outputs:     None
+       Clobbers:    Flags only
+       Segments:    CS/SS only
+
+    VMI_DeliverInterrupts (For future debate)
+
+       Enable and deliver any pending interrupts.  This would remove
+       the implicit delivery semantic from the SetInterruptMask and
+       EnableInterrupts calls.
+
+    VMI_Pause
+
+       VMICALL void VMI_Pause(void);
+
+       Pause the processor temporarily, to allow a hypertwin or remote
+       CPU to continue operation without lock or cache contention.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_Halt
+
+       VMICALL void VMI_Halt(void);
+
+       Put the processor into interruptible halt mode.  This is defined
+       to be a non-running mode where maskable interrupts are enabled,
+       not a deep low power sleep mode.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_Shutdown
+
+       VMICALL void VMI_Shutdown(void);
+
+       Put the processor into non-interruptible halt mode.  This is defined
+       to be a non-running mode where maskable interrupts are disabled,
+       indicates a power-off event for this CPU.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_Reboot:
+
+       VMICALL void VMI_Reboot(VMI_INT how);
+
+       Reboot the virtual machine, using a hard or soft reboot.  A soft
+       reboot corresponds to the effects of an INIT IPI, and preserves
+       some APIC and CR state.  A hard reboot corresponds to a hardware
+       reset.
+
+       Inputs:      EAX = reboot mode
+                      #define VMI_REBOOT_SOFT 0x0
+                      #define VMI_REBOOT_HARD 0x1
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+   VMI_SetInitialAPState:
+
+       void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+       Sets the initial state of the application processor with local APIC ID
+       "apicID" to the state in apState.  apState must be the page-aligned
+       linear address of the APState structure describing the initial state of
+       the specified application processor.
+
+       Control register CR0 must have both PE and PG set;  the result of
+       either of these bits being cleared is undefined.  It is recommended
+       that for best performance, all processors in the system have the same
+       setting of the CR4 PAE bit.  LME and LMA in EFER are both currently
+       unsupported.  The result of setting either of these bits is undefined.
+
+       Inputs:      EAX = pointer to APState structure for new co-processor
+                    EDX = APIC ID of processor to initialize
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   DESCRIPTOR RELATED CALLS
+
+    VMI_SetGDT
+
+       VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
+
+       Load the global descriptor table limit and base registers.  In
+       addition to the straightforward load of the hardware registers, this
+       has the additional side effect of reloading all segment registers in a
+       virtual machine.  The reason is that otherwise, the hidden part of
+       segment registers (the base field) may be put into a non-reversible
+       state.  Non-reversible segments are problematic because they can not be
+       reloaded - any subsequent loads of the segment will load the new
+       descriptor state.  In general, is not possible to resume direct
+       execution of the virtual machine if certain segments become
+       non-reversible.
+       
+       A load of the GDTR may cause the guest visible memory image of the GDT
+       to be changed.  This allows the hypervisor to share the GDT pages with
+       the guest, but also continue to maintain appropriate protections on the
+       GDT page by transparently adjusting the DPL and RPL of descriptors in
+       the GDT.
+
+       Inputs:      EAX = pointer to descriptor limit / base
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_SetIDT
+
+       VMICALL void VMI_SetIDT(VMI_DTR *idtr);
+
+       Load the interrupt descriptor table limit and base registers.  The IDT
+       format is defined to be the same as native hardware.
+
+       A load of the IDTR may cause the guest visible memory image of the IDT
+       to be changed.  This allows the hypervisor to rewrite the IDT pages in
+       a format more suitable to the hypervisor, which may include adjusting
+       the DPL and RPL of descriptors in the guest IDT.
+
+       Inputs:      EAX = pointer to descriptor limit / base
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_SetLDT
+
+       VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
+
+       Load the local descriptor table.  This has the additional side effect
+       of of reloading all segment registers.  See VMI_SetGDT for an
+       explanation of why this is required.  A load of the LDT may cause the
+       guest visible memory image of the LDT to be changed, just as GDT and
+       IDT loads.
+
+       Inputs:      EAX = GDT selector of LDT descriptor
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_SetTR
+
+       VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
+
+       Load the task register.  Functionally equivalent to the LTR
+       instruction.
+
+       Inputs:      EAX = GDT selector of TR descriptor
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_GetGDT
+
+       VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
+
+       Copy the GDT limit and base fields into the provided pointer.  This is
+       equivalent to the SGDT instruction, which is non-virtualizable.
+       
+       Inputs:      EAX = pointer to descriptor limit / base
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_GetIDT
+
+       VMICALL void VMI_GetIDT(VMI_DTR *idtr);
+
+       Copy the IDT limit and base fields into the provided pointer.  This is
+       equivalent to the SIDT instruction, which is non-virtualizable.
+       
+       Inputs:      EAX = pointer to descriptor limit / base
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_GetLDT
+
+       VMICALL VMI_SELECTOR VMI_GetLDT(void);
+
+       Load the task register.  Functionally equivalent to the SLDT
+       instruction, which is non-virtualizable.
+
+       Inputs:      None
+       Outputs:     EAX = selector of LDT descriptor
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_GetTR
+
+       VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+       Load the task register.  Functionally equivalent to the STR
+       instruction, which is non-virtualizable.
+
+       Inputs:      None
+       Outputs:     EAX = selector of TR descriptor
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_WriteGDTEntry
+
+       VMICALL void VMI_WriteGDTEntry(void *gdt, VMI_UINT entry,
+                                      VMI_UINT32 descLo,
+                                      VMI_UINT32 descHi);
+
+       Write a descriptor to a GDT entry.  Note that writes to the GDT itself
+       may be disallowed by the hypervisor, in which case this call must be
+       converted into a hypercall.  In addition, since the descriptor may need
+       to be modified to change limits and / or permissions, the guest kernel
+       should not assume the update will be binary identical to the passed
+       input.
+
+       Inputs:      EAX   = pointer to GDT base
+                    EDX   = GDT entry number
+                    ECX   = descriptor low word
+                    ST(1) = descriptor high word
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_WriteLDTEntry
+
+       VMICALL void VMI_WriteLDTEntry(void *gdt, VMI_UINT entry,
+                                      VMI_UINT32 descLo,
+                                      VMI_UINT32 descHi);
+
+       Write a descriptor to a LDT entry.  Note that writes to the LDT itself
+       may be disallowed by the hypervisor, in which case this call must be
+       converted into a hypercall.  In addition, since the descriptor may need
+       to be modified to change limits and / or permissions, the guest kernel
+       should not assume the update will be binary identical to the passed
+       input.
+
+       Inputs:      EAX   = pointer to LDT base
+                    EDX   = LDT entry number
+                    ECX   = descriptor low word
+                    ST(1) = descriptor high word
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_WriteIDTEntry
+
+       VMICALL void VMI_WriteIDTEntry(void *gdt, VMI_UINT entry,
+                                      VMI_UINT32 descLo,
+                                      VMI_UINT32 descHi);
+
+       Write a descriptor to a IDT entry.  Since the descriptor may need to be
+       modified to change limits and / or permissions, the guest kernel should
+       not assume the update will be binary identical to the passed input.
+
+       Inputs:      EAX   = pointer to IDT base
+                    EDX   = IDT entry number
+                    ECX   = descriptor low word
+                    ST(1) = descriptor high word
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+
+   CPU CONTROL CALLS
+
+    These calls encapsulate the set of privileged instructions used to
+    manipulate the CPU control state.  These instructions are all properly
+    virtualizable using trap and emulate, but for performance reasons, a
+    direct call may be more efficient.  With hardware virtualization
+    capabilities, many of these calls can be left as IDENT translations, that
+    is, inline implementations of the native instructions, which are not
+    rewritten by the hypervisor.  Some of these calls are performance critical
+    during context switch paths, and some are not, but they are all included
+    for completeness, with the exceptions of the obsoleted LMSW and SMSW
+    instructions.
+
+    VMI_WRMSR
+
+       VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+
+       Write to a model specific register.  This functions identically to the
+       hardware WRMSR instruction.  Note that a hypervisor may not implement
+       the full set of MSRs supported by native hardware, since many of them
+       are not useful in the context of a virtual machine.
+
+       Inputs:      ECX = model specific register index 
+                    EAX = low word of register
+                    EDX = high word of register
+       Outputs:     None
+       Clobbers:    Standard, Memory
+       Segments:    Standard
+
+    VMI_RDMSR
+
+       VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+       Read from a model specific register.  This functions identically to the
+       hardware RDMSR instruction.  Note that a hypervisor may not implement
+       the full set of MSRs supported by native hardware, since many of them
+       are not useful in the context of a virtual machine.
+
+       Inputs:      ECX = machine specific register index 
+       Outputs:     EAX = low word of register
+                    EDX = high word of register
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_SetCR0
+
+       VMICALL void VMI_SetCR0(VMI_UINT val);
+
+       Write to control register zero.  This can cause TLB flush and FPU
+       handling side effects.  The set of features available to the kernel
+       depend on the completeness of the hypervisor.  An explicit list of
+       supported functionality or required settings may need to be negotiated
+       by the hypervisor and kernel during bootstrapping.  This is likely to
+       be implementation or vendor specific, and the precise restrictions are
+       not yet worked out.  Our implementation in general supports turning on
+       additional functionality - enabling protected mode, paging, page write
+       protections; however, once those features have been enabled, they may
+       not be disabled on the virtual hardware.
+
+       Inputs:      EAX = input to control register
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_SetCR2
+
+       VMICALL void VMI_SetCR2(VMI_UINT val);
+
+       Write to control register two.  This has no side effects other than
+       updating the CR2 register value.
+
+       Inputs:      EAX = input to control register
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_SetCR3
+
+       VMICALL void VMI_SetCR3(VMI_UINT val);
+
+       Write to control register three.  This causes a TLB flush on the local
+       processor.  In addition, this update may be queued as part of a lazy
+       call invocation, which allows multiple hypercalls to be issued during
+       the context switch path.  The queuing convention is to be negotiated
+       with the hypervisor during bootstrapping, but the interfaces for this
+       negotiation are currently vendor specific.
+
+       Inputs:      EAX = input to control register
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+       Queue Class: MMU
+
+    VMI_SetCR4
+
+       VMICALL void VMI_SetCR3(VMI_UINT val);
+
+       Write to control register four.  This can cause TLB flush and many
+       other CPU side effects.  The set of features available to the kernel
+       depend on the completeness of the hypervisor.  An explicit list of
+       supported functionality or required settings may need to be negotiated
+       by the hypervisor and kernel during bootstrapping.  This is likely to
+       be implementation or vendor specific, and the precise restrictions are
+       not yet worked out.  Our implementation in general supports turning on
+       additional MMU functionality - enabling global pages, large pages, PAE
+       mode, and other features - however, once those features have been
+       enabled, they may not be disabled on the virtual hardware.  The
+       remaining CPU control bits of CR4 remain active and behave identically
+       to real hardware.
+
+       Inputs:      EAX = input to control register
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_GetCR0
+    VMI_GetCR2
+    VMI_GetCR3
+    VMI_GetCR4
+
+       VMICALL VMI_UINT32 VMI_GetCR0(void);
+       VMICALL VMI_UINT32 VMI_GetCR2(void);
+       VMICALL VMI_UINT32 VMI_GetCR3(void);
+       VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+       Read the value of a control register into EAX.  The register contents
+       are identical to the native hardware control registers; CR0 contains
+       the control bits and task switched flag, CR2 contains the last page
+       fault address, CR3 contains the page directory base pointer, and CR4
+       contains various feature control bits.
+
+       Inputs:      None
+       Outputs:     EAX = value of control register
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_CLTS
+
+       VMICALL void VMI_CLTS(void);
+
+       Used to clear the task switched (TS) flag in control register zero.  A
+       replacement for the CLTS instruction.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+     VMI_SetDR
+
+       VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+
+       Set the debug register to the given value.  If a hypervisor
+       implementation supports debug registers, this functions equivalently to
+       native hardware move to DR instructions.
+
+       Inputs:      EAX = debug register number
+                    EDX = debug register value
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+     VMI_GetDR
+
+       VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+       Read a debug register.  If debug registers are not supported, the
+       implementation is free to return zero values.
+
+       Inputs:      EAX = debug register number
+       Outputs:     EAX = debug register value
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   PROCESSOR INFORMATION CALLS
+
+    These calls provide access to processor identification, performance and
+    cycle data, which may be inaccurate due to the nature of running on
+    virtual hardware.   This information may be visible in a non-virtualizable
+    way to applications running outside of the kernel.  As such, both RDTSC
+    and RDPMC should be disabled by kernels or hypervisors where information
+    leakage is a concern, and the accuracy of data retrieved by these functions
+    is up to the individual hypervisor vendor.
+
+    VMI_CPUID
+
+       /* Not expressible as a C function */
+
+       The CPUID instruction provides processor feature identification in a
+       vendor specific manner.  The instruction itself is non-virtualizable
+       without hardware support, requiring a hypervisor assisted CPUID call
+       that emulates the effect of the native instruction, while masking any
+       unsupported CPU feature bits.
+
+       Inputs:       EAX = CPUID number
+                     ECX = sub-level query (nonstandard)
+       Outputs:      EAX = CPUID dword 0
+                     EBX = CPUID dword 1
+                     ECX = CPUID dword 2
+                     EDX = CPUID dword 3
+       Clobbers:     Flags only
+       Segments:     Standard
+
+    VMI_RDTSC
+
+       VMICALL VMI_UINT64 VMI_RDTSC(void);
+
+       The RDTSC instruction provides a cycles counter which may be made
+       visible to userspace.  For better or worse, many applications have made
+       use of this feature to implement userspace timers, database indices, or
+       for micro-benchmarking of performance.  This instruction is extremely
+       problematic for virtualization, because even though it is selectively 
+       virtualizable using trap and emulate, it is much more expensive to
+       virtualize it in this fashion.  On the other hand, if this instruction
+       is allowed to execute without trapping, the cycle counter provided
+       could be wrong in any number of circumstances due to hardware drift,
+       migration, suspend/resume, CPU hotplug, and other unforeseen
+       consequences of running inside of a virtual machine.  There is no
+       standard specification for how this instruction operates when issued
+       from userspace programs, but the VMI call here provides a proper
+       interface for the kernel to read this cycle counter.
+
+       Inputs:      None
+       Outputs:     EAX = low word of TSC cycle counter
+                    EDX = high word of TSC cycle counter
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_RDPMC
+
+       VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+       Similar to RDTSC, this call provides the functionality of reading
+       processor performance counters.  It also is selectively visible to
+       userspace, and maintaining accurate data for the performance counters
+       is an extremely difficult task due to the side effects introduced by
+       the hypervisor.
+
+       Inputs:      ECX = performance counter index 
+       Outputs:     EAX = low word of counter
+                    EDX = high word of counter
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   STACK / PRIVILEGE TRANSITION CALLS
+    
+    This set of calls encapsulates mechanisms required to transfer between
+    higher privileged kernel tasks and userspace.  The stack switching and
+    return mechanisms are also used to return from interrupt handlers into
+    the kernel, which may involve atomic interrupt state and stack
+    transitions.
+
+    VMI_UpdateKernelStack
+
+       VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+       Inform the hypervisor that a new kernel stack pointer has been loaded
+       in the TSS structure.  This new kernel stack pointer will be used for
+       entry into the kernel on interrupts from userspace.
+
+       Inputs:      EAX = pointer to TSS structure
+                    EDX = new kernel stack top
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_IRET
+
+       /* No C prototype provided */
+
+       Perform a near equivalent of the IRET instruction, which atomically
+       switches off the current stack and restore the interrupt mask.  This
+       may return to userspace or back to the kernel from an interrupt or
+       exception handler.  The VMI_IRET call does not restore IOPL from the
+       stack image, as the native hardware equivalent would.  Instead, IOPL
+       must be explicitly restored using a VMI_SetIOPL call.  The VMI_IRET
+       call does, however, restore the state of the EFLAGS_VM bit from the
+       stack image in the event that the hypervisor and kernel both support
+       V8086 execution mode.  If the hypervisor does not support V8086 mode,
+       this can be silently ignored, generating an error that the guest must
+       deal with.  Note this call is made using a CALL instruction, just as
+       all other VMI calls, so the EIP of the call site is available to the
+       VMI layer.  This allows faults during the sequence to be properly
+       passed back to the guest kernel with the correct EIP.
+
+       Note that returning to userspace with interrupts disabled is an invalid
+       operation in a paravirtualized kernel, and the results of an attempt to
+       do so are undefined.
+
+       Also note that when issuing the VMI_IRET call, the userspace data
+       segments may have already been restored, so only the stack and code
+       segments can be assumed valid.
+
+       There is currently no support for IRET calls from a 16-bit stack
+       segment, which poses a problem for supporting certain userspace
+       applications which make use of high bits of ESP on a 16-bit stack.  How
+       to best resolve this is an open question.  One possibility is to
+       introduce a new VMI call which can operate on 16-bit segments, since it
+       is desirable to make the common case here as fast as possible.
+
+       Inputs:      ST(0) = New EIP
+                    ST(1) = New CS
+                    ST(2) = New Flags (including interrupt mask)
+                    ST(3) = New ESP (for userspace returns)
+                    ST(4) = New SS (for userspace returns)
+                    ST(5) = New ES (for v8086 returns)
+                    ST(6) = New DS (for v8086 returns)
+                    ST(7) = New FS (for v8086 returns)
+                    ST(8) = New GS (for v8086 returns)
+       Outputs:     None (does not return)
+       Clobbers:    None (does not return)
+       Segments:    CS / SS only
+
+    VMI_SYSEXIT
+
+       /* No C prototype provided */
+
+       For hypervisors and processors which support SYSENTER / SYSEXIT, the
+       VMI_SYSEXIT call is provided as a binary equivalent to the native
+       SYSENTER instruction.  Since interrupts must always be enabled in
+       userspace, the VMI version of this function always combines atomically
+       enabling interrupts with the return to userspace.
+
+       Inputs:      EDX = New EIP
+                    ECX = New ESP
+       Outputs:     None (does not return)
+       Clobbers:    None (does not return)
+       Segments:    CS / SS only
+
+
+   I/O CALLS
+
+    This set of calls incorporates I/O related calls - PIO, setting I/O
+    privilege level, and forcing memory writeback for device coherency.
+
+    VMI_INB
+    VMI_INW
+    VMI_INL
+
+       VMICALL VMI_UINT8  VMI_INB(VMI_UINT dummy, VMI_UINT port);
+       VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+       VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+       Input a byte, word, or doubleword from an I/O port.  These
+       instructions have binary equivalent semantics to native instructions.
+
+       Inputs:      EDX = port number
+                      EDX, rather than EAX is used, because the native
+                      encoding of the instruction may use this register
+                      implicitly.
+       Outputs:     EAX = port value
+       Clobbers:    Memory only
+       Segments:    Standard
+
+    VMI_OUTB
+    VMI_OUTW
+    VMI_OUTL
+   
+       VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+       VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+       VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+       Output a byte, word, or doubleword to an I/O port.  These
+       instructions have binary equivalent semantics to native instructions.
+
+       Inputs:      EAX = port value
+                    EDX = port number
+       Outputs:     None
+       Clobbers:    None
+       Segments:    Standard
+
+    VMI_INSB
+    VMI_INSW
+    VMI_INSL
+
+       /* Not expressible as C functions */
+
+       Input a string of bytes, words, or doublewords from an I/O port.  These
+       instructions have binary equivalent semantics to native instructions.
+       They do not follow a C calling convention, and clobber only the same
+       registers as native instructions.
+
+       Inputs:      EDI = destination address
+                    EDX = port number
+                    ECX = count
+       Outputs:     None
+       Clobbers:    ESI, ECX, Memory
+       Segments:    Standard
+
+    VMI_OUTSB
+    VMI_OUTSW
+    VMI_OUTSL
+
+       /* Not expressible as C functions */
+
+       Output a string of bytes, words, or doublewords to an I/O port.  These
+       instructions have binary equivalent semantics to native instructions.
+       They do not follow a C calling convention, and clobber only the same
+       registers as native instructions.
+
+       Inputs:      ESI = source address
+                    EDX = port number
+                    ECX = count
+       Outputs:     None
+       Clobbers:    ESI, ECX
+       Segments:    Standard
+
+    VMI_IODelay
+
+       VMICALL void VMI_IODelay(void);
+
+       Delay the processor by time required to access a bus register.  This is
+       easily implemented on native hardware by an access to a bus scratch
+       register, but is typically not useful in a virtual machine.  It is
+       paravirtualized to remove the overhead implied by executing the native
+       delay.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_SetIOPLMask
+
+       VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+       Set the IOPL mask of the processor to allow userspace to access I/O
+       ports.  Note the mask is pre-shifted, so an IOPL of 3 would be
+       expressed as (3 << 12).  If the guest chooses to use IOPL to allow
+       CPL-3 access to I/O ports, it must explicitly set and restore IOPL
+       using these calls; attempting to set the IOPL flags with popf or iret
+       may produce no result.
+
+       Inputs:      EAX = Mask
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_WBINVD
+
+       VMICALL void VMI_WBINVD(void);
+
+       Write back and invalidate the data cache.  This is used to synchronize
+       I/O memory.
+
+       Inputs:      None
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_INVD
+
+       This instruction is deprecated.  It is invalid to execute in a virtual
+       machine.  It is documented here only because it is still declared in
+       the interface, and dropping it required a version change.
+
+
+   APIC CALLS
+
+    APIC virtualization is currently quite simple.  These calls support the
+    functionality of the hardware APIC in a form that allows for more
+    efficient implementation in a hypervisor, by avoiding trapping access to
+    APIC memory.  The calls are kept simple to make the implementation
+    compatible with native hardware.  The APIC must be mapped at a page
+    boundary in the processor virtual address space.
+
+    VMI_APICWrite
+   
+       VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
+
+       Write to a local APIC register.  Side effects are the same as native
+       hardware APICs.
+
+       Inputs:      EAX = APIC register address
+                    EDX = value to write
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_APICRead
+
+       VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+       Read from a local APIC register.  Side effects are the same as native
+       hardware APICs.
+
+       Inputs:      EAX = APIC register address
+       Outputs:     EAX = APIC register value
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   TIMER CALLS
+
+    The VMI interfaces define a highly accurate and efficient timer interface
+    that is available when running inside of a hypervisor.  This is an
+    optional but highly recommended feature which avoids many of the problems
+    presented by classical timer virtualization.  It provides notions of
+    stolen time, counters, and wall clock time which allows the VM to
+    get the most accurate information in a way which is free of races and
+    legacy hardware dependence.
+
+    VMI_GetWallclockTime
+
+       VMI_NANOSECS VMICALL VMI_GetWallclockTime(void);
+
+       VMI_GetWallclockTime returns the current wallclock time as the number
+       of nanoseconds since the epoch.  Nanosecond resolution along with the
+       64-bit unsigned type provide over 580 years from epoch until rollover.
+       The wallclock time is relative to the host's wallclock time.
+
+       Inputs:      None
+       Outputs:     EAX = low word, wallclock time in nanoseconds 
+                    EDX = high word, wallclock time in nanoseconds 
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_WallclockUpdated
+
+       VMI_BOOL     VMICALL VMI_WallclockUpdated(void);
+    
+       VMI_WallclockUpdated returns TRUE if the wallclock time has changed
+       relative to the real cycle counter since the previous time that
+       VMI_WallclockUpdated was polled.  For example, while a VM is suspended,
+       the real cycle counter will halt, but wallclock time will continue to
+       advance.  Upon resuming the VM, the first call to VMI_WallclockUpdated
+       will return TRUE.
+
+       Inputs:      None
+       Outputs:     EAX = 0 for FALSE, 1 for TRUE
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_GetCycleFrequency
+
+       VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+
+       VMI_GetCycleFrequency returns the number of cycles in one second.  This
+       value can be used by the guest to convert between cycles and other time
+       units.
+
+       Inputs:      None
+       Outputs:     EAX = low word, cycle frequency
+                    EDX = high word, cycle frequency
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_GetCycleCounter
+
+       VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+       VMI_GetCycleCounter returns the current value, in cycles units, of the
+       counter corresponding to 'whichCounter' if it is one of
+       VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN.
+       VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'.
+
+       Inputs:      EAX = counter index, one of
+                        #define VMI_CYCLES_REAL        0
+                        #define VMI_CYCLES_AVAILABLE   1
+                        #define VMI_CYCLES_STOLEN      2
+       Outputs:     EAX = low word, cycle counter
+                    EDX = high word, cycle counter 
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_SetAlarm
+
+       VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+                                 VMI_CYCLES period);
+
+       VMI_SetAlarm is used to arm the vcpu's alarms.  The 'flags' parameter
+       is used to specify which counter's alarm is being set (VMI_CYCLES_REAL
+       or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu
+       (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode
+       (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC).  If the alarm is set
+       against the VMI_ALARM_STOLEN counter or an undefined counter number,
+       the call is a nop.  The 'expiry' parameter indicates the expiry of the
+       alarm, and for periodic alarms, the 'period' parameter indicates the
+       period of the alarm.  If the value of 'period' is zero, the alarm is
+       armed as a one-shot alarm regardless of the mode specified by 'flags'.
+       Finally, a call to VMI_SetAlarm for an alarm that is already armed is
+       equivalent to first calling VMI_CancelAlarm and then calling
+       VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not
+       accessible.
+       
+       /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+
+       Inputs:      EAX   = flags value, cycle counter number or'ed with
+                        #define VMI_ALARM_WIRED_IRQ0   0x00000000
+                        #define VMI_ALARM_WIRED_LVTT   0x00010000
+                        #define VMI_ALARM_IS_ONESHOT   0x00000000
+                        #define VMI_ALARM_IS_PERIODIC  0x00000100
+                    EDX   = low word, alarm expiry
+                    ECX   = high word, alarm expiry
+                    ST(0) = low word, alarm expiry
+                    ST(1) = high word, alarm expiry
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+   VMI_CancelAlarm
+
+       VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+       VMI_CancelAlarm is used to disarm an alarm.  The 'flags' parameter
+       indicates which alarm to cancel (VMI_CYCLES_REAL or
+       VMI_CYCLES_AVAILABLE).  The return value indicates whether or not the
+       cancel succeeded.  A return value of FALSE indicates that the alarm was
+       already disarmed either because a) the alarm was never set or b) it was
+       a one-shot alarm and has already fired (though perhaps not yet
+       delivered to the guest).  TRUE indicates that the alarm was armed and
+       either a) the alarm was one-shot and has not yet fired (and will no
+       longer fire until it is rearmed) or b) the alarm was periodic.
+
+       Inputs:      EAX = cycle counter number
+       Outputs:     EAX = 0 for FALSE, 1 for TRUE
+       Clobbers:    Standard
+       Segments:    Standard
+
+
+   MMU CALLS
+
+    The MMU plays a large role in paravirtualization due to the large
+    performance opportunities realized by gaining insight into the guest
+    machine's use of page tables.  These calls are designed to accommodate the
+    existing MMU functionality in the guest OS while providing the hypervisor
+    with hints that can be used to optimize performance to a large degree.
+
+    VMI_SetLinearMapping
+
+       VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+                                         VMI_UINT32 pages, VMI_UINT32 ppn);
+
+       /* The number of VMI address translation slot */
+       #define VMI_LINEAR_MAP_SLOTS    4
+
+       Register a virtual to physical translation of virtual address range to
+       physical pages.  This may be used to register single pages or to
+       register large ranges.  There is an upper limit on the number of active
+       mappings, which should be sufficient to allow the hypervisor and VMI
+       layer to perform page translation without requiring dynamic storage.
+       Translations are only required to be registered for addresses used to
+       access page table entries through the VMI page table access functions.
+       The guest is free to use the provided linear map slots in a manner that
+       it finds most convenient.  Kernels which linearly map a large chunk of
+       physical memory and use page tables in this linear region will only
+       need to register one such region after initialization of the VMI.
+       Hypervisors which do not require linear to physical conversion hints
+       are free to leave these calls as NOPs, which is the default when
+       inlined into the native kernel.
+
+       Inputs:      EAX   = linear map slot
+                    EDX   = virtual address start of mapping
+                    ECX   = number of pages in mapping
+                    ST(0) = physical frame number to which pages are mapped
+       Outputs:     None
+       Clobbers:    Standard
+       Segments:    Standard
+
+    VMI_FlushTLB
+
+       VMICALL void VMI_FlushTLB(int how);
+   
+       Flush all non-global mappings in the TLB, optionally flushing global
+       mappings as well.  The VMI_FLUSH_TLB flag should always be specified,
+       optionally or'ed with the VMI_FLUSH_GLOBAL flag.
+
+       Inputs:      EAX = flush type
+                       #define VMI_FLUSH_TLB            0x01
+                       #define VMI_FLUSH_GLOBAL         0x02
+       Outputs:     None
+       Clobbers:    Standard, memory (implied)
+       Segments:    Standard
+
+    VMI_InvalPage
+
+       VMICALL void VMI_InvalPage(VMI_UINT32 va);
+
+       Invalidate the TLB mapping for a single page or large page at the
+       given virtual address.
+
+       Inputs:      EAX = virtual address
+       Outputs:     None
+       Clobbers:    Standard, memory (implied)
+       Segments:    Standard
+
+   The remaining documentation here needs updating when the PTE accessors are
+   simplified.
+
+    70) VMI_SetPte
+
+        void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep);
+
+        Assigns a new value to a page table / directory entry. It is a
+        requirement that ptep points to a page that has already been
+        registered with the hypervisor as a page of the appropriate type
+       using the VMI_RegisterPageUsage function.
+            
+    71) VMI_SwapPte           
+
+        VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep);
+
+        Write 'pte' into the page table entry pointed by 'ptep', and returns
+        the old value in 'ptep'.  This function acts atomically on the PTE
+        to provide up to date A/D bit information in the returned value.
+
+    72) VMI_TestAndSetPteBit
+
+        VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep);
+
+        Atomically set a bit in a page table entry.  Returns zero if the bit
+        was not set, and non-zero if the bit was set.
+
+    73) VMI_TestAndClearPteBit 
+
+        VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep);
+
+        Atomically clear a bit in a page table entry.  Returns zero if the bit
+        was not set, and non-zero if the bit was set.
+
+    74) VMI_SetPteLong
+    75) VMI_SwapPteLong           
+    76) VMI_TestAndSetPteBitLong
+    77) VMI_TestAndClearPteBitLong
+
+        void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep);
+        VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep);
+        VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+        VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+        
+        These functions act identically to the 32-bit PTE update functions,
+        but provide support for PAE mode.  The calls are guaranteed to never
+        create a temporarily invalid but present page mapping that could be
+        accidentally prefetched by another processor, and all returned bits
+        are guaranteed to be atomically up to date.
+
+        One special exception is the VMI_SwapPteLong function only provides
+        synchronization against A/D bits from other processors, not against
+        other invocations of VMI_SwapPteLong.
+
+    78) VMI_ClonePageTable
+        VMI_ClonePageDirectory
+
+        #define VMI_MKCLONE(start, count) (((start) << 16) | (count))
+
+        void VMI_ClonePageTable(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+                                VMI_UINT32 flags);
+        void VMI_ClonePageDirectory(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+                                VMI_UINT32 flags);
+
+        These functions tell the hypervisor to allocate a page shadow
+        at the PT or PD level using a shadow template.  Because of the
+        availability of bits in the flags, these calls may be merged
+        together as well as flag the PAE-ness of the shadows.
+
+    80) VMI_RegisterPageUsage
+    81) VMI_ReleasePage
+
+        #define VMI_PAGE_PT              0x01
+        #define VMI_PAGE_PD              0x02
+        #define VMI_PAGE_PDP             0x04
+        #define VMI_PAGE_PML4            0x08
+        #define VMI_PAGE_GDT             0x10
+        #define VMI_PAGE_LDT             0x20
+        #define VMI_PAGE_IDT             0x40
+        #define VMI_PAGE_TSS             0x80
+
+        void  VMI_RegisterPageUsage(VMI_UINT32 ppn, int flags);
+        void  VMI_ReleasePage(VMI_UINT32 ppn, int flags);
+
+        These are used to register a page with the hypervisor as being of a
+        particular type, for instance, VMI_PAGE_PT says it is a page table
+        page.  
+
+    85) VMI_SetDeferredMode
+
+        void VMI_SetDeferredMode(VMI_UINT32 deferBits); 
+
+        Set the lazy state update mode to the specified set of bits.  This
+        allows the processor, hypervisor, or VMI layer to lazily update
+        certain CPU and MMU state.  When setting this to a more permissive
+        setting, no flush is implied, but when clearing bits in the current
+        defer mask, all pending state will be flushed.
+
+        The 'deferBits' is a mask specifying how to flush.
+
+            #define VMI_DEFER_NONE          0x00
+
+        Disallow all asynchronous state updates.  This is the default
+        state.
+
+            #define VMI_DEFER_MMU           0x01
+
+       Flush all pending page table updates.  Note that page faults,
+        invalidations and TLB flushes will implicitly flush all pending
+        updates. 
+
+            #define VMI_DEFER_CPU           0x02
+
+        Allow CPU state updates to control registers to be deferred, with
+        the exception of updates that change FPU state.  This is useful
+        for combining a reload of the page table base in CR3 with other
+        updates, such as the current kernel stack.
+
+            #define VMI_DEFER_DT            0x04
+
+        Allow descriptor table updates to be delayed.  This allows the
+        VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued.
+
+    86) VMI_FlushDeferredCalls
+
+        void VMI_FlushDeferredCalls(void);
+
+        Flush all asynchronous state updates which may be queued as
+        a result of setting deferred update mode.
+
+
+Appendix B - VMI C prototypes
+
+   Most of the VMI calls are properly callable C functions.  Note that for the
+   absolute best performance, assembly calls are preferable in some cases, as
+   they do not imply all of the side effects of a C function call, such as
+   register clobber and memory access.  Nevertheless, these wrappers serve as
+   a useful interface definition for higher level languages.
+
+   In some cases, a dummy variable is passed as an unused input to force
+   proper alignment of the remaining register values.
+
+   The call convention for these is defined to be standard GCC convention with
+   register passing.  The regparm call interface is documented at:
+
+   http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
+
+   Types used by these calls:
+
+   VMI_UINT64   64 bit unsigned integer
+   VMI_UINT32   32 bit unsigned integer
+   VMI_UINT16   16 bit unsigned integer
+   VMI_UINT8    8 bit unsigned integer
+   VMI_INT      32 bit integer
+   VMI_UINT     32 bit unsigned integer
+   VMI_DTR      6 byte compressed descriptor table limit/base
+   VMI_PTE      4 byte page table entry (or page directory)
+   VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE)
+   VMI_SELECTOR 16 bit segment selector
+   VMI_BOOL     32 bit unsigned integer
+   VMI_CYCLES   64 bit unsigned integer
+   VMI_NANOSECS 64 bit unsigned integer
+
+
+   #ifndef VMI_PROTOTYPES_H
+   #define VMI_PROTOTYPES_H
+
+   /* Insert local type definitions here */
+   typedef struct VMI_DTR {
+      uint16 limit;
+      uint32 offset __attribute__ ((packed));
+   } VMI_DTR;
+
+   typedef struct APState {
+      VMI_UINT32 cr0;
+      VMI_UINT32 cr2;
+      VMI_UINT32 cr3;
+      VMI_UINT32 cr4;
+
+      VMI_UINT64 efer;
+
+      VMI_UINT32 eip;
+      VMI_UINT32 eflags;
+      VMI_UINT32 eax;
+      VMI_UINT32 ebx;
+      VMI_UINT32 ecx;
+      VMI_UINT32 edx;
+      VMI_UINT32 esp;
+      VMI_UINT32 ebp;
+      VMI_UINT32 esi;
+      VMI_UINT32 edi;
+      VMI_UINT16 cs;
+      VMI_UINT16 ss;
+
+      VMI_UINT16 ds;
+      VMI_UINT16 es;
+      VMI_UINT16 fs;
+      VMI_UINT16 gs;
+      VMI_UINT16 ldtr;
+
+      VMI_UINT16 gdtrLimit;
+      VMI_UINT32 gdtrBase;
+      VMI_UINT32 idtrBase;
+      VMI_UINT16 idtrLimit;
+   } APState;
+
+   #define VMICALL __attribute__((regparm(3)))
+
+   /* CORE INTERFACE CALLS */
+   VMICALL void VMI_Init(void);
+
+   /* PROCESSOR STATE CALLS */
+   VMICALL void     VMI_DisableInterrupts(void);
+   VMICALL void     VMI_EnableInterrupts(void);
+
+   VMICALL VMI_UINT VMI_GetInterruptMask(void);
+   VMICALL void     VMI_SetInterruptMask(VMI_UINT mask);
+
+   VMICALL void     VMI_Pause(void);
+   VMICALL void     VMI_Halt(void);
+   VMICALL void     VMI_Shutdown(void);
+   VMICALL void     VMI_Reboot(VMI_INT how);
+
+   #define VMI_REBOOT_SOFT 0x0
+   #define VMI_REBOOT_HARD 0x1
+
+   void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+   /* DESCRIPTOR RELATED CALLS */
+   VMICALL void         VMI_SetGDT(VMI_DTR *gdtr);
+   VMICALL void         VMI_SetIDT(VMI_DTR *idtr);
+   VMICALL void         VMI_SetLDT(VMI_SELECTOR ldtSel);
+   VMICALL void         VMI_SetTR(VMI_SELECTOR ldtSel);
+
+   VMICALL void         VMI_GetGDT(VMI_DTR *gdtr);
+   VMICALL void         VMI_GetIDT(VMI_DTR *idtr);
+   VMICALL VMI_SELECTOR VMI_GetLDT(void);
+   VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+   VMICALL void         VMI_WriteGDTEntry(void *gdt,
+                                          VMI_UINT entry,
+                                          VMI_UINT32 descLo,
+                                          VMI_UINT32 descHi);
+   VMICALL void         VMI_WriteLDTEntry(void *gdt,
+                                          VMI_UINT entry,
+                                          VMI_UINT32 descLo,
+                                          VMI_UINT32 descHi);
+   VMICALL void         VMI_WriteIDTEntry(void *gdt,
+                                          VMI_UINT entry,
+                                          VMI_UINT32 descLo,
+                                          VMI_UINT32 descHi);
+
+   /* CPU CONTROL CALLS */
+   VMICALL void       VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+   VMICALL void       VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi,
+                                      VMI_UINT32 reg);
+
+   /* Not truly a proper C function; use dummy to align reg in ECX */
+   VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+   VMICALL void VMI_SetCR0(VMI_UINT val);
+   VMICALL void VMI_SetCR2(VMI_UINT val);
+   VMICALL void VMI_SetCR3(VMI_UINT val);
+   VMICALL void VMI_SetCR4(VMI_UINT val);
+
+   VMICALL VMI_UINT32 VMI_GetCR0(void);
+   VMICALL VMI_UINT32 VMI_GetCR2(void);
+   VMICALL VMI_UINT32 VMI_GetCR3(void);
+   VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+   VMICALL void       VMI_CLTS(void);
+
+   VMICALL void       VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+   VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+   /* PROCESSOR INFORMATION CALLS */
+
+   VMICALL VMI_UINT64 VMI_RDTSC(void);
+   VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+   /* STACK / PRIVILEGE TRANSITION CALLS */
+   VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+   /* I/O CALLS */
+   /* Native port in EDX - use dummy */
+   VMICALL VMI_UINT8  VMI_INB(VMI_UINT dummy, VMI_UINT port);
+   VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+   VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+   VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+   VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+   VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+   VMICALL void VMI_IODelay(void);
+   VMICALL void VMI_WBINVD(void);
+   VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+   /* APIC CALLS */
+   VMICALL void       VMI_APICWrite(void *reg, VMI_UINT32 value);
+   VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+   /* TIMER CALLS */
+   VMICALL VMI_NANOSECS VMI_GetWallclockTime(void);
+   VMICALL VMI_BOOL     VMI_WallclockUpdated(void);
+
+   /* Predefined rate of the wallclock. */
+   #define VMI_WALLCLOCK_HZ       1000000000
+
+   VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+   VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+   /* Defined cycle counters */
+   #define VMI_CYCLES_REAL        0
+   #define VMI_CYCLES_AVAILABLE   1
+   #define VMI_CYCLES_STOLEN      2
+
+   VMICALL void     VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+                                 VMI_CYCLES period);
+   VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+   /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+   #define VMI_ALARM_COUNTER_MASK 0x000000ff
+
+   #define VMI_ALARM_WIRED_IRQ0   0x00000000
+   #define VMI_ALARM_WIRED_LVTT   0x00010000
+
+   #define VMI_ALARM_IS_ONESHOT   0x00000000
+   #define VMI_ALARM_IS_PERIODIC  0x00000100
+
+   /* MMU CALLS */
+   VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+                                     VMI_UINT32 pages, VMI_UINT32 ppn);
+
+   /* The number of VMI address translation slot */
+   #define VMI_LINEAR_MAP_SLOTS    4
+
+   VMICALL void VMI_InvalPage(VMI_UINT32 va);
+   VMICALL void VMI_FlushTLB(int how);
+   
+   /* Flags used by VMI_FlushTLB call */
+   #define VMI_FLUSH_TLB            0x01
+   #define VMI_FLUSH_GLOBAL         0x02
+
+   #endif
+
+
+Appendix C - Sensitive x86 instructions in the paravirtual environment
+
+  This is a list of x86 instructions which may operate in a different manner
+  when run inside of a paravirtual environment.
+
+       ARPL - continues to function as normal, but kernel segment registers
+              may be different, so parameters to this instruction may need
+              to be modified. (System)
+       
+       IRET - the IRET instruction will be unable to change the IOPL, VM,
+              VIF, VIP, or IF fields. (System)
+
+              the IRET instruction may #GP if the return CS/SS RPL are
+              below the CPL, or are not equal. (System)
+
+       LAR  - the LAR instruction will reveal changes to the DPL field of
+              descriptors in the GDT and LDT tables. (System, User)
+
+       LSL  - the LSL instruction will reveal changes to the segment limit
+              of descriptors in the GDT and LDT tables. (System, User)
+
+       LSS  - the LSS instruction may #GP if the RPL is not set properly.
+              (System)
+
+       MOV  - the mov %seg, %reg instruction may reveal a different RPL
+              on the segment register. (System)
+
+              The mov %reg, %ss instruction may #GP if the RPL is not set
+              to the current CPL. (System)
+
+       POP  - the pop %ss instruction may #GP if the RPL is not set to
+              the appropriate CPL. (System)
+
+       POPF - the POPF instruction will be unable to set the hardware
+              interrupt flag. (System)
+
+       PUSH - the push %seg instruction may reveal a different RPL on the
+              segment register. (System)
+
+       PUSHF- the PUSHF instruction will reveal a possible different IOPL,
+              and the value of the hardware interrupt flag, which is always
+              set.  (System, User)
+
+       SGDT - the SGDT instruction will reveal the location and length of
+              the GDT shadow instead of the guest GDT. (System, User)
+
+       SIDT - the SIDT instruction will reveal the location and length of
+              the IDT shadow instead of the guest IDT. (System, User)
+
+       SLDT - the SLDT instruction will reveal the selector used for
+              the shadow LDT rather than the selector loaded by the guest.
+              (System, User).
+
+       STR  - the STR instruction will reveal the selector used for the
+              shadow TSS rather than the selector loaded by the guest.
+              (System, User).

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] [RFC, PATCH 1/24] i386 Vmi documentation