RE: [Xen-devel] Full virtualization and I/O

> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of 
> Thomas Heinz
> Sent: 20 November 2006 23:39
> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: [Xen-devel] Full virtualization and I/O
> 
> Hi
> 
> Full virtualization is about providing multiple virtual ISA level 
> environments and mapping them to a single physical one. One 
> particular 
> aspect of this mapping are I/O instructions (explicit or 
> mmapped I/O). In 
> general, there are two strategies to partition the devices, 
> either in time 
> or in space. Partitioning a device in space means that the 
> device (or a 
> part of it) is exclusively available to a single VM. 
> Partitioning a device 
> in time (or time multiplexing) means that it can be used by 
> multiple VMs 
> but only one VM may use it at any point in time.

The Xen approach is to not allow any sharing of devices, a device is
owned by one domain, no other domain can directly access the device.
There is a protocol of so called frontend/backend driver which is
basically a dummy-device that forwards a request to another domain
(normally domain 0) and the other half of the driver-pair is picking up
this data, forwards it to some processing task, that then sends the
packet onto the real hardware. 

For fully virtualized mode (hardware supported virtual machine, such as
AMD-V or Intel VT, aka HVM), there is a different model, where a "device
model" is involved to perform the hardware modelling. In Xen, this is
using a modified version of qemu (called qemu-dm), which has a fairly
complete set of "hardware" in it's model. It's got for example IDE
controller, several types of network devices, graphics and
mouse/keyboard models. The things you'd usually find in a PC, that is.
The way it works is that the hypervisor intercepts IOIO and memory
mapped IO regions that match the devices involved (such as the
A0000-BFFFF region for VGA frame buffer memory or the 0x1F0-0x1F7 IO
ports for the IDE controller), and forwards a request from the
hypervisor to qemu-dm, where the operation changes the current state,
and when it's necessary, the state-change will result in for example a
read-request to the "hard-disk" (which may be a real disk, a file on a
local disk, or a file on a network storage device, to give some
examples). 

There is also the option of using the frontend drivers as described
above in the fully virtualized model. 

Finally, while I'm on the subject of fully virtualized mode: It is
currently not possible to give a DMA-based device to a fully-virtualized
domain. The reason for this is that the guest OS will have been told
that memory is from 0..256MB (say), and it's actual machine physical
address is at 256MB..512MB. The OS is completely unaware of this
"mismatch". So the OS will perform some operation to take a virtual
address of some buffer (say a network packet) and make it into a
"physical address", which will be an address in the range of 0..256MB.
This will of course (at least) lead to the wrong data being transmitted,
as the address of the actual data is somewhere in the range
256MB..512MB. The only solution to this is to have an IOMMU, which can
translate the guest's understanding of a physical address (0..256MB) to
a machine physical address (256..512MB). 

> 
> I am trying to understand how I/O virtualization on the ISA 
> level works if 
> a device is shared between multiple VM instances. On a very 
> high level, it 
> should be as follows. First of all, the VMM has to intercept 
> the VM's I/O 
> commands (I/O instructions or load/store to dedicated memory 
> addresses - 
> let's ignore interrupts for the moment). This could be done 
> by traps or by 
> replacing the resp. instructions by VMM calls to I/O 
> primitives. The VMM 
> keeps multiple device model instances (one for each VM using 
> the device) 
> in memory. The models somehow reflect the low level I/O API 
> of the device. 
> Depending on which I/O command is issued by the VM, either the memory 
> model is changed or a number of I/O instructions are executed 
> to make the 
> physical device state reflect the one represented in the memory model.

Do you by ISA mean "Instruction Set Architecture" or something else (I
presume it's NOT meaning ISA-bus...)?

Intercepting IOIO instructions or MMIO instructions is not that hard -
in HVM the two processor architectures have specific intercepts and
bitmaps to indicate which IO instructions should be intercepted. MMIO
will require the page-tables to be set up such that the memory mapped
region is mapped "not present" so that any operation to this region
gives a page-fault, and then the page-fault is analyzed to see if it's
for a MMIO address or for a "real page fault". 

For para-virtualization, the model is similar, but the exact model of
how to intercept the IOIO or MMIO instruction is slightly different -
but in essence it's the same principle. Let me know if you really need
to know how Xen goes about doing this, as it's quite complicated (more
so than the HVM version, for sure). 


> 
> This approach brings up a number of questions. It would be 
> great if some of 
> the virtualization experts here could shed some light on them 
> (even though 
> they are not immediately related to Xen, I know):
> 
> - How do these device memory models look like? Is there a common
>   (automata) theory behind or are they done ad hoc?

Not sure what you're asking for here. Since the devices are either
modeled after a REAL device (qemu-dm) and as such will resemble as
closely as possible the REAL hardware device that it's emulating, or in
the frontend/backend driver, there is an "idealized model", such that
the request contains just the basic data that the OS provides normally
to the driver, and it's placed in a queue with a message-signaling
system to tell the other side that it's got something in the queue. 

> - What kind of strategies/algorithms are used in the merge 
> phase, i.e. the
>   phase where the virtual memory model and the physical one are
>   synchronized? What kind of problems can occur in this phase?

The Xen approach is to avoid this by only giving one device to each
machine. 

> - Are specific usage patterns used in real world implementations (e.g.
>   VMWare) to simplify the virtualization (model or merge phase)?

This is probably the wrong list to ask detailed questions about how
VMWare works... ;-)

> - Do you have any interesting pointers to literature dealing 
> with full I/O
>   virtualization? In particular, how does VMWare's full virtualization
>   works with respect to I/O?

Again, wrong list for VMWare questions. 

> - Is every device time partitionable? If not, which 
> requirements does it
>   have to meet to be time partitionable?

Certainly not - I would say that almost all devices are NOT time
partitionable, as the state in the device is dependant on the current
usage. The more complex the device is, the more likely it is to have
difficulties, but even such a simple deevice as a serial port would
struggle to work in a time-shared fashion (not to mention that serial
ports generally are used for multiple transactions to make a whole
"bigger picture transaction", so for example a web-server connected via
a serial modem would send a packet of several hundred bytes to the
serial port driver, which is then portioned out as and when the serial
port is ready to send another few bytes. If you switch from one guest to
another during this process, and the second guest also has something to
send on the serial port, you'd end up with a very scrambled message from
the first guest and quite likely the second guests message completely
lost!). 

There are some devices that are specifically built to manage multiple
hosts, but other than that, any sharing of a device requires some
software to gather up "a full transaction" and then sending that to the
actual hardware, often also waiting for the transaction to complete (for
example the interrupt signal to say that the hard disk write is
complete). 


>   -> I don't think every device is. What about a device which supports
>      different modes of operation. If two VMs drive the 
> virtual device in
>      different modes, it may not be possible to constantly 
> switch between
>      them. Ok, this is pretty artificial.

A particular problem is devices where you can't necessarily read back
the last mode-setting, which may well be the case in many different
devices. You can't, for example, read back all the registers on an IDE
device, because the read of a particular address amy give the status
rather than the current comamnd sent, or some such. 

--
Mats
> 
> Thanks a lot for your help!
> 
> 
> Best wishes
> 
> Thomas
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Full virtualization and I/O