RE: [Xen-devel] Full virtualization and I/O

 

> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of 
> Thomas Heinz
> Sent: 23 November 2006 16:23
> To: Petersson, Mats
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] Full virtualization and I/O
> 
> Hi Mats
> 
> Thanks a lot for your detailed reply!
> 
> You wrote:
> > For fully virtualized mode (hardware supported virtual 
> machine, such as
> > AMD-V or Intel VT, aka HVM), there is a different model, 
> where a "device
> > model" is involved to perform the hardware modelling. In 
> Xen, this is
> > using a modified version of qemu (called qemu-dm), which 
> has a fairly
> > complete set of "hardware" in it's model. It's got for example IDE
> > controller, several types of network devices, graphics and
> > mouse/keyboard models. The things you'd usually find in a 
> PC, that is.
> > The way it works is that the hypervisor intercepts IOIO and memory
> > mapped IO regions that match the devices involved (such as the
> > A0000-BFFFF region for VGA frame buffer memory or the 0x1F0-0x1F7 IO
> > ports for the IDE controller), and forwards a request from the
> > hypervisor to qemu-dm, where the operation changes the 
> current state,
> > and when it's necessary, the state-change will result in 
> for example a
> > read-request to the "hard-disk" (which may be a real disk, 
> a file on a
> > local disk, or a file on a network storage device, to give some
> > examples).
> 
> This is very interesting. So qemu models the low level device 
> interface 
> (I/O interface) in software and translates I/O actions to 
> either model 
> changes or to library or system calls (since QEMU runs as 
> normal process).
> 
> Is there any documentation about this or is the source the doc ;)

I haven't looked for any documentation - for the work I've done using
QEMU, I've just used to source as doc's. It's a fairly large project, so
there may be some docs somewhere.... 


> 
> > Do you by ISA mean "Instruction Set Architecture" or 
> something else (I
> > presume it's NOT meaning ISA-bus...)?
> 
> Yes, I mean instruction set architecture.
> 
> > Intercepting IOIO instructions or MMIO instructions is not 
> that hard -
> > in HVM the two processor architectures have specific intercepts and
> > bitmaps to indicate which IO instructions should be 
> intercepted. MMIO
> > will require the page-tables to be set up such that the 
> memory mapped
> > region is mapped "not present" so that any operation to this region
> > gives a page-fault, and then the page-fault is analyzed to 
> see if it's
> > for a MMIO address or for a "real page fault".
> >
> > For para-virtualization, the model is similar, but the 
> exact model of
> > how to intercept the IOIO or MMIO instruction is slightly 
> different -
> > but in essence it's the same principle. Let me know if you 
> really need
> > to know how Xen goes about doing this, as it's quite 
> complicated (more
> > so than the HVM version, for sure).
> 
> Although it is interesting to see how interception works in 
> detail, I am 
> currently more interested in how device state is modelled and 
> translated 
> into system/library calls or sequences of I/O instructions. 
> So, in fact 
> the operation after the interception has taken place.

Ok, so if we take as an example a IDE block read, it consists of several
IO instructions:
Assuming void outb(uint16 port_no, uint8 value) is a 

outb(0x1f2, sector_count);
outb(0x1f3, sector_number);
outb(0x1f4, cylinder_lsb);
outb(0x1f5, cylinder_lsb);
outb(0x1f6, drive_head);
outb(0x1f7, command);

[In LBA mode, sector_number and the two cylinder numbers (and I think
the head part of drive_head) convert into a "large" sector number,
rather than cylinder/head/sector combination]. 

In QEMU, the initial 5 writes will just change the internal state of the
IDE controller (i.e. the number of sectors, sector/cylinder numbers,
etc, are just stored in some per-controller data structure). Note that
disk0 and disk1 on one controller shares the same register set - drive
bit out of drive_head selects drive 0 or drive 1. 

The sixth out (in our sequence, it's based on the address, not the
number of writes), will tell QEMU that the sequence is a complete
transaction, and it will go ahead and perform the read of the strage
that corresponds to the IDE device (such as a file or partition). 

The data read is stored in a "per device" buffer, when the code is
complete on the device-side, the guest will be informed of this via a
virtual interrupt. This will, assuming normal behaviour then trigger a
512-byte (in the form of 16-bit "in" or "ins" instruction with a port
address of 0x1f0) where the data is read by the guest into whatever
memory it wanted to use. 

The completion of a write operation, on the other hand, is of course
complete first when the 512-byte sector has been written using the "out"
or "outs" instruction to port 0x1f0. 

> 
> > Not sure what you're asking for here. Since the devices are either
> > modeled after a REAL device (qemu-dm) and as such will resemble as
> > closely as possible the REAL hardware device that it's 
> emulating, or in
> > the frontend/backend driver, there is an "idealized model", 
> such that
> > the request contains just the basic data that the OS 
> provides normally
> > to the driver, and it's placed in a queue with a message-signaling
> > system to tell the other side that it's got something in the queue.
> 
> I am basically asking about general/theoretical concepts 
> behind device 
> modelling as e.g. done by qemu. I think it's a good idea to 
> understand how 
> qemu actually does this.
> 
> > Certainly not - I would say that almost all devices are NOT time
> > partitionable, as the state in the device is dependant on 
> the current
> > usage. The more complex the device is, the more likely it is to have
> > difficulties, but even such a simple deevice as a serial port would
> > struggle to work in a time-shared fashion (not to mention 
> that serial
> > ports generally are used for multiple transactions to make a whole
> > "bigger picture transaction", so for example a web-server 
> connected via
> > a serial modem would send a packet of several hundred bytes to the
> > serial port driver, which is then portioned out as and when 
> the serial
> > port is ready to send another few bytes. If you switch from 
> one guest to
> > another during this process, and the second guest also has 
> something to
> > send on the serial port, you'd end up with a very scrambled 
> message from
> > the first guest and quite likely the second guests message 
> completely
> > lost!).
> 
> Very nice example. Clearly, high level driver interfaces (e.g. 
> send/receive, read/write) can be designed in a way that 
> time-sharing is 
> possible, e.g. using message/transaction queues. On the I/O 
> level, it is 
> likely to be harder to reconstruct the "full transaction". It 
> might also 
> be necessary to make assumptions about the actual guest, i.e. 
> the way the 
> device is being used.

Yes, this is essentially how the frontend/backend drivers work. They
send a complete "high level" message to for example send a ethernet
packet or write a sector to the disk. As the message is "complete" (not
dependant on other messages), it's entirely possible (and in fact I
believe that's how Xen works) to use a single back-end driver (per
device type) for multiple front-end drivers. 

Of course, if we're talking about disk access, there is another
complication, which has nothing to do with the actual physical
interface: the meta-data that is the "filesystem" will also need to be
guaranteed to be "correct". Most filesystems have a whole lot of
different data structures (such as list of free blocks, directory
structures, file-name-to-directory-entry binary tree, etc). If you have
two guest operating systems writing to the same "disk", the filesystem
will most certainly get corrupted... For example, imagine that both
systems are creating a new file, and picks the same block from the free
block list... Or deleting files at the same time and putting two
different free blocks into the same free block list entry... 

So, even if you could share the device-interface, the consistency of the
actual device would not be good if two guests DID share the
disk-interface to the same physical instance of a "disk" (whether it's
ACTUALLY a real disk or a file-based disk-image that "pretends" to be a
disk). 

> 
> > A particular problem is devices where you can't necessarily 
> read back
> > the last mode-setting, which may well be the case in many different
> > devices. You can't, for example, read back all the 
> registers on an IDE
> > device, because the read of a particular address amy give the status
> > rather than the current comamnd sent, or some such.
> 
> This could be stored in memory when you have a virtual 
> (in-memory) device 
> model.

Sure, that's how it works in QEMU - but that requires that you intercept
the actual operation and stores what the individual steps of a full
transaction.

--
Mats
> 
> 
> Best wishes
> 
> Thomas
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Full virtualization and I/O