RE: [Xen-devel] Questioning the Xen Design of the VMM

> -----Original Message-----
> From: Al Boldi [mailto:a1426z@xxxxxxxxx] 
> Sent: 09 August 2006 13:53
> To: Petersson, Mats
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Petersson, Mats wrote:
> > > > Al Boldi wrote:
> > > > > I maybe missing something, but why should the Xen-design
> > > > > require the guest to be patched?
> >
> > The main reason to use a para-virtual kernel that it performs better
> > than the fully virtualized version.
> >
> > > So HVM solves the problem, but why can't this layer be 
> implemented in
> > > software?
> >
> > It CAN, and has been done.
> 
> You mean full virtualization using binary translation in software?

Yes, exactly - or other types of "full virtualiziation" using software -
I haven't made a complete inventory of "different technologies used for
virtualization on x86", so I can't really say - my job with AMD and Xen
is to implement into Xen the parts that support the AMD virtualization,
not understand the entire VM architecture available in the world... 
> 
> My understanding was, that HVM implies full virtualization 
> without the need 
> for binary translation in software.

Yes, that's generally correct. In very detail, there are some variants
of execution where this is broken, but that's the obscure corner cases,
rather than the normal behaviour. In particular, Intel's VT doesn't
support running real-mode inside a virtual machine, so if the guest is
run in real-mode, it requires some forms of "emulation" (actually, the
current solution uses a VM86 mode of the processor, and it's then only
having to emulate the opcodes that fault when run in VM86 mode). There
are some things that we (AMD) didn't get perfectly right either, and as
such could be improved... 
> 
> > It is however, a little bit difficult to
> > cover some of the "strange" corner cases, as the x86 
> processor wasn't
> > really designed to handle virtualization natively [until these
> > extensions where added].
> 
> You mean AMDV/IntelVT extensions?

Yes. 
> 
> If so, then these extensions don't actively participate in the act of 
> virtualization, but rather fix some x86-arch shortcomings, 
> that make it 
> easier for software (i.e. Xen) to virtualize, thus 
> circumventing the need to 
> do binary translation.  Is this a correct reading?

Not sure what your exact meaning is here. 

What do you mean by "actively participate in the act of virtualization".
Please clarify, and examplify an architecture where the hardware is
ACTIVELY taking part in the virtualization - do you mean a hardware
implementation of a hypervisor. [as, again, I haven't spent an awful lot
of time trying to understand how/what can and can't be done in other
architectures - as far as I understand it, both AMD and Intel's
virtualization technologies are fairly close "copies" IBM's original
implementation on the 360 series machines, so I expect that what
can/can't be done in that, is what can/can't be done in the x86 world]. 

I do agree that it removes the need for binary translation and
emulation, and makes the writing of the software to manage the VM's
easier. It also helps in the sense that it allows more selective
intercepts than for example ring compression (where all protected
instructions are "faulting", whether it's actually necessary for the
hypervisor to intercept or not - for example, it's completely useless
for the hypervisor to know when the guest reads or writes to CR2 - but
CR2 is a protected register, so it's going to get intercepted by a
ring-compressed kernel.), so fewer intercepts. It's also more easy to
determine the actual intercept reason in a virtualization enhanced
processor, since it gives an "exitcode" to indicate the reason for the
"exit" back to the hypervisor. 

> 
> > This is why you end up with binary translation
> > in VMWare for example. For example, let's say that we use 
> the method of
> > "ring compression" (which is when the guest-OS is moved from Ring 0
> > [full privileges] to Ring 1 [less than full privileges]), and the
> > hypervisor wants to have full control of interrupt flags:
> >
> > some_function:
> >     ...
> >     pushf                   // Save interrupt flag.
> >     cli                     // Disable interrupts
> >     ...
> >     ...
> >     ...
> >     popf                    // Restore interrupt flag.
> >     ...
> >
> > In Ring 0, all this works just fine - but of course, we 
> don't know that
> > the guest-OS tried to disable interrupts, so we have to change
> > something. In Ring 1, the guest can't disable interrupts, so the CLI
> > instruction can be intercepted. Great. But pushf/popf is a valid
> > instruction in all four rings - it just doesn't change the interrupt
> > enable flag in the flags register if you're not allowed to use the
> > CLI/STI instructions! So, that means that interrupts are disabled
> > forever after [until an STI instruction gets found by 
> chance, at least].
> >
> >
> > And if the next bit of code is:
> >
> >     mov     someaddress, eax                // someaddress is
> > updated by an interrupt!
> > $1:
> >     cmp     someaddress, eax                // Check it...
> >     jz      $1
> >
> > Then we'd very likely never get out of there, since the 
> actual interrupt
> > causing someaddress to change is believed by the VMM to be disabled.
> >
> > There is no real way to make popf trap [other than supplying it with
> > invalid arguments in virtual 8086 mode, which isn't really 
> a practical
> > thing to do here!]
> >
> > Another problem is "hidden bits" in registers.
> >
> > Let's say this:
> >
> >     mov     cr0, eax
> >     mov     eax, ecx
> >     or      $1, eax
> >     mov     eax, cr0
> >     mov     $0x10, eax
> >     mov     eax, fs
> >     mov     ecx, cr0
> >
> >     mov     $0xF000000, eax
> >     mov     $10000, ecx
> > $1:
> >     mov     $0, fs:eax
> >     add     $4, eax
> >     dec     ecx
> >     jnz     $1
> >
> > Let's now say that we have an interrupt that the hypervisor 
> would handle
> > in the loop in the above code. The hypervisor itself uses 
> FS for some
> > special purpose, and thus needs to save/restore the FS 
> register. When it
> > returns, the system will crash (GP fault) because the FS 
> register limit
> > is 0xFFFF (64KB) and eax is greater than the limit - but 
> the limit of FS
> > was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> > behaviour like this is terribly difficult to deal with, and 
> there really
> > isn't any good way to solve these issues [other than not 
> allowing the
> > code to run when it does "funny" things like this - or to 
> perform the
> > necessary code in "translation mode" - i.e. emulate each 
> instruction ->
> > slow(ish)].
> 
> Or introduce AMDV/IntelVT extensions?
> 
> > > I'm sure there can't be a performance issue, as this
> > > virtualization doesn't
> > > occur on the physical resource level, but is (should be)
> > > rather implemented
> > > as some sort of a multiplexed routing algorithm, I think :)
> >
> > I'm not entirely sure what this statement is trying to say, but as I
> > understand the situation, performance is entirely the 
> reason why the Xen
> > paravirtual model was implemented - all other VMM's are 
> slower [although
> > it's often hard to prove that, since for example Vmware 
> have the rule
> > that they have to give permission before publishing 
> benchmarks of their
> > product, and of course that permission would only be given in cases
> > where there is some benefit to them].
> >
> > One of the obvious reasons for para-virtual being better than full
> > virtualization is that it can be used in a "batched" mode. 
> Let's say we
> > have some code that does this:
> >
> > ...
> >     p = malloc(2000 * 4096);
> > ...
> >
> > Let's then say that the guts of malloc ends up in something 
> like this:
> >
> > map_pages_to_user(...)
> > {
> >     for(v = random_virtual_address, p = start_page; p < end_page;
> > p++, v+=4096)
> >             map_one_page_to_user(p, v);
> > }
> >
> > In full virtualization, we have no way to understand that someone is
> > mapping 2000 pages to the same user-process in one guest, 
> we'd just see
> > writes to the page-table one page at a time.
> >
> > In the para-virtual case, we could do something like:
> > map_pages_to_user(...)
> > {
> >     hypervisor_map_pages_to_user(current_process, start_page,
> > end_page,
> > random_virtual_address);
> > }
> >
> > Now, the hypervisor knows "the full story" and can map all 
> those pages
> > in one go - much quicker, I would say. There's still more 
> work than in
> > the native case, but it's much closer to the native case.
> 
> Sure, but wouldn't this be for the price of losing guest-OS 
> transparency?

Life is full of compromizes between one ideal solution and another. In
an ideal world, virtualization wouldn't cost anything, but it does.

Loosing guest-OS transparency when the geust-OS is open-source isn't
really a big issue, in my opinion. However, if you haven't got
source-code readily available, it becomes a big issue - since without
source code, it gets much harder to make the necessary modifications
(probably to the extent that it's actually IMPOSSIBLE to make them in a
sane and reliable manner). 

There is no doubt that para-virtualization is one viable solution to the
virtualization problem, but it's not the ONLY solution. Each user has a
choice: Recompile and get performance, or run unmodified code at lower
performance. 

--
Mats
> 
> 
> Thanks!
> 
> --
> Al
> 
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Questioning the Xen Design of the VMM