WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setu

To: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
From: "Stephen C. Tweedie" <sct@xxxxxxxxxx>
Date: Thu, 09 Nov 2006 03:49:02 +0000
Cc: Wilfred Yu <wilfred.yu@xxxxxxxxx>, Xiaohui Xin <xiaohui.xin@xxxxxxxxx>, "Li, Xin B" <xin.b.li@xxxxxxxxx>, Susie Li <susie.li@xxxxxxxxx>, Steven Rostedt <srostedt@xxxxxxxxxx>, Herbert Xu <herbert.xu@xxxxxxxxxx>
Delivery-date: Wed, 08 Nov 2006 19:49:40 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Hi,

We have been debugging a nasty VMX issue recently, where certain kernel
builds would fail to boot under VMX from syslinux,  whereas others
booted fine and the same kernel booted OK under grub.  In practice, this
means that we have only ever seen this on bootable ISO images.

The underlying problem turned out to be vmxassist mishandling the
setting up of segment registers on entry into VM86_PROTECTED mode.  The
bug occurs when the guest is entering protected mode and reaches the far
jump to set up the 32-bit segment registers for the virtual VMENTER
we're about to perform.  vmxassist, in protected_mode(), looks at the
segment registers, which might be 16-bit segments or might be 32-bit
segment selectors at this stage (depending on whether the OS has
reloaded them since entering protected mode); and it tries to load the
segments from the GDT.  Unconditionally.  Even if the segment register
still actually contains a real mode 16-bit segment.  Whoops.


Now, enter the *second* bug, this time in the main Linux kernel  itself.
On x86_64, the kernel boot sequence sets up a bogus GDT:
arch/x86_64/boot/setup.S has

gdt_48:
        .word   0x8000                          # gdt limit=2048,
                                                #  256 GDT entries

        .word   0, 0                            # gdt base (filled in later)

which shows that we think we have a 2048-byte GDT, with 256 entries
(that all adds up fine); but 2048 is 0x800, not 0x8000.  So when we do
an lgdt to set this gdt up, we are making it 16 times too big.


Unfortunately, when we enter this code from syslinux, SS has a 16-bit
value still, 0x3000.  That should be way off the end of the GDT and
hence illegal as a descriptor even if we mistakenly tried to load it,
but because the GDT has been loaded with the wrong size, vmxassist
thinks that 0x3000 *IS* a valid segment descriptor, and loads it into
the VMCS for the guest's protected mode VMENTER.

And so, if, by chance, the 8 bytes at (GDT+0x3000) in the kernel image
pass the VMENTER instruction's simple sanity tests, we ignore the
problem and shortly afterwards the kernel will load up a valid SS; but
if we fail those sanity tests then the VMENTER fails with "bad guest
state".  It's just luck whether a given vmlinuz works or not.


The reason that some kernels show this problem and others do not under
Xen is because of the 0x8000 GDT-size kernel bug.  But the blame lies
squarely with Xen, because the kernel has never loaded any segments from
the undefined GDT area above 0x800, and yet vmxassist tried to set up a
VMCS segment from it anyway.

So, while we would still like to fix the kernel GDT, the *real* problem
here is in vmxassist's mishandling of segments during the move to
protected mode.


Now, vmxassist already has code to detect 16-bit segments that survived
unmodified from a transition into and out of protected mode, and to save
and restore those appropriately.  It does this using "saved_rm_regs",
which get cleared on entry to protected mode, and then set to the old
segment value if we fail to set a given 32-bit segment correctly.

The fix is to save the 16-bit segments *always*, on entry to protected
mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
and set the 32-bit variant in oldctx whenever a 32-bit segment
descriptor is set during the transition to 32-bit CS.  Then, when we
finally do the VMENTER, we will set up the VMCS from only the 32-bit
segments, clearing the VMCS entries for segments that have not been
assigned valid 32-bit segments yet.

Tested on various RHEL-5 boot.isos, including ones which worked before
and ones which triggered the bug; all now boot correctly.

Signed-off-by: Stephen Tweedie <sct@xxxxxxxxxx>

diff -r 52ae8dd4bc75 tools/firmware/vmxassist/vm86.c
--- a/tools/firmware/vmxassist/vm86.c   Tue Oct 17 22:09:52 2006 +0100
+++ b/tools/firmware/vmxassist/vm86.c   Thu Nov 09 01:24:57 2006 +0000
@@ -847,6 +847,18 @@ load_seg(unsigned long sel, uint32_t *ba
 }
 
 /*
+ * Emulate a protected mode segment load, falling back to clearing it if
+ * the descriptor was invalid.
+ */
+static void
+load_or_clear_seg(unsigned long sel, uint32_t *base, uint32_t *limit, union 
vmcs_arbytes *arbytes)
+{
+       if (!load_seg(sel, base, limit, arbytes))
+               load_seg(0, base, limit, arbytes);          
+}
+
+
+/*
  * Transition to protected mode
  */
 static void
@@ -857,8 +869,6 @@ protected_mode(struct regs *regs)
        oldctx.eip = regs->eip;
        oldctx.esp = regs->uesp;
        oldctx.eflags = regs->eflags;
-
-       memset(&saved_rm_regs, 0, sizeof(struct regs));
 
        /* reload all segment registers */
        if (!load_seg(regs->cs, &oldctx.cs_base,
@@ -866,55 +876,16 @@ protected_mode(struct regs *regs)
                panic("Invalid %%cs=0x%x for protected mode\n", regs->cs);
        oldctx.cs_sel = regs->cs;
 
-       if (load_seg(regs->ves, &oldctx.es_base,
-                               &oldctx.es_limit, &oldctx.es_arbytes))
-               oldctx.es_sel = regs->ves;
-       else {
-               load_seg(0, &oldctx.es_base,
-                           &oldctx.es_limit, &oldctx.es_arbytes);
-               oldctx.es_sel = 0;
-               saved_rm_regs.ves = regs->ves;
-       }
-
-       if (load_seg(regs->uss, &oldctx.ss_base,
-                               &oldctx.ss_limit, &oldctx.ss_arbytes))
-               oldctx.ss_sel = regs->uss;
-       else {
-               load_seg(0, &oldctx.ss_base,
-                           &oldctx.ss_limit, &oldctx.ss_arbytes);
-               oldctx.ss_sel = 0;
-               saved_rm_regs.uss = regs->uss;
-       }
-
-       if (load_seg(regs->vds, &oldctx.ds_base,
-                               &oldctx.ds_limit, &oldctx.ds_arbytes))
-               oldctx.ds_sel = regs->vds;
-       else {
-               load_seg(0, &oldctx.ds_base,
-                           &oldctx.ds_limit, &oldctx.ds_arbytes);
-               oldctx.ds_sel = 0;
-               saved_rm_regs.vds = regs->vds;
-       }
-
-       if (load_seg(regs->vfs, &oldctx.fs_base,
-                               &oldctx.fs_limit, &oldctx.fs_arbytes))
-               oldctx.fs_sel = regs->vfs;
-       else {
-               load_seg(0, &oldctx.fs_base,
-                           &oldctx.fs_limit, &oldctx.fs_arbytes);
-               oldctx.fs_sel = 0;
-               saved_rm_regs.vfs = regs->vfs;
-       }
-
-       if (load_seg(regs->vgs, &oldctx.gs_base,
-                               &oldctx.gs_limit, &oldctx.gs_arbytes))
-               oldctx.gs_sel = regs->vgs;
-       else {
-               load_seg(0, &oldctx.gs_base,
-                           &oldctx.gs_limit, &oldctx.gs_arbytes);
-               oldctx.gs_sel = 0;
-               saved_rm_regs.vgs = regs->vgs;
-       }
+       load_or_clear_seg(oldctx.es_sel, &oldctx.es_base,
+                         &oldctx.es_limit, &oldctx.es_arbytes);
+       load_or_clear_seg(oldctx.ss_sel, &oldctx.ss_base,
+                         &oldctx.ss_limit, &oldctx.ss_arbytes);
+       load_or_clear_seg(oldctx.ds_sel, &oldctx.ds_base,
+                         &oldctx.ds_limit, &oldctx.ds_arbytes);
+       load_or_clear_seg(oldctx.fs_sel, &oldctx.fs_base,
+                         &oldctx.fs_limit, &oldctx.fs_arbytes);
+       load_or_clear_seg(oldctx.gs_sel, &oldctx.gs_base,
+                         &oldctx.gs_limit, &oldctx.gs_arbytes);
 
        /* initialize jump environment to warp back to protected mode */
        regs->cs = CODE_SELECTOR;
@@ -1002,6 +973,16 @@ set_mode(struct regs *regs, enum vm86_mo
        case VM86_REAL_TO_PROTECTED:
                if (mode == VM86_REAL) {
                        regs->eflags |= EFLAGS_TF;
+                       saved_rm_regs.vds = regs->vds;
+                       saved_rm_regs.ves = regs->ves;
+                       saved_rm_regs.vfs = regs->vfs;
+                       saved_rm_regs.vgs = regs->vgs;
+                       saved_rm_regs.uss = regs->uss;
+                       oldctx.ds_sel = 0;
+                       oldctx.es_sel = 0;
+                       oldctx.fs_sel = 0;
+                       oldctx.gs_sel = 0;
+                       oldctx.ss_sel = 0;
                        break;
                } else if (mode == VM86_REAL_TO_PROTECTED) {
                        break;
@@ -1262,6 +1243,10 @@ opcode(struct regs *regs)
                        else
                                regs->ves = pop16(regs);
                        TRACE((regs, regs->eip - eip, "pop %%es"));
+                       if (mode == VM86_REAL_TO_PROTECTED) {
+                               saved_rm_regs.ves = 0;
+                               oldctx.es_sel = regs->ves;
+                       }
                        return OPC_EMULATED;
 
                case 0x0F: /* two byte opcode */
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
<Prev in Thread] Current Thread [Next in Thread>