WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

To: "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx>, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx>, "Keir Fraser" <keir.fraser@xxxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
From: "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx>
Date: Wed, 19 May 2010 14:26:54 -0500
Cc:
Delivery-date: Wed, 19 May 2010 12:32:00 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <C8195105.14964%keir.fraser@xxxxxxxxxxxxx> <EACA7CA90354A849B1315959042A052C26F377@xxxxxxxxxxxxxxxxxxxxx> <EACA7CA90354A849B1315959042A052C26F37A@xxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acrr2MGaf7yfnFaQTZycCDfaoiD3fwGKD9vEATZRItYAEo2wXAAOFwx+AARLdsgABs3OZA==
Thread-topic: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

I got a working solution.  The problem occurs because an HVM domain gets created without EPT support causing the global variable's bits to be cleared.  When the comparison is done, crashes because of the mismatch.

If you guys find it acceptable, I can port it to xenunstable for integration to that tree.

Roger R. Cruz


diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
@@ -46,7 +46,9 @@
 u32 vmx_cpu_based_exec_control __read_mostly;
 u32 vmx_secondary_exec_control __read_mostly;
 u32 vmx_vmexit_control __read_mostly;
+u32 vmx_vmexit_control_must_clear __read_mostly;
 u32 vmx_vmentry_control __read_mostly;
+u32 vmx_vmentry_control_must_clear __read_mostly;
 bool_t cpu_has_vmx_ins_outs_instr_info __read_mostly;

 static DEFINE_PER_CPU(struct vmcs_struct *, host_vmcs);
@@ -187,7 +189,9 @@
         vmx_cpu_based_exec_control = _vmx_cpu_based_exec_control;
         vmx_secondary_exec_control = _vmx_secondary_exec_control;
         vmx_vmexit_control         = _vmx_vmexit_control;
+        vmx_vmexit_control_must_clear = 0;
         vmx_vmentry_control        = _vmx_vmentry_control;
+        vmx_vmentry_control_must_clear = 0;
         cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
         vmx_display_features();
     }
@@ -198,7 +202,9 @@
         BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
         BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
         BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
+        _vmx_vmexit_control &= ~vmx_vmexit_control_must_clear;
         BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
+        _vmx_vmentry_control &= ~vmx_vmentry_control_must_clear;
         BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
         BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
                !!(vmx_basic_msr_high & (1U<<22)));
@@ -533,9 +539,11 @@
     else
     {
         v->arch.hvm_vmx.secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
-        vmx_vmexit_control &= ~(VM_EXIT_SAVE_GUEST_PAT |
-                                VM_EXIT_LOAD_HOST_PAT);
-        vmx_vmentry_control &= ~VM_ENTRY_LOAD_GUEST_PAT;
+        vmx_vmexit_control_must_clear |= (VM_EXIT_SAVE_GUEST_PAT |
+                                         VM_EXIT_LOAD_HOST_PAT);
+        vmx_vmexit_control &= ~vmx_vmexit_control_must_clear;
+        vmx_vmentry_control_must_clear |= VM_ENTRY_LOAD_GUEST_PAT;
+        vmx_vmentry_control &= ~vmx_vmentry_control_must_clear;
     }

     /* Do not enable Monitor Trap Flag unless start single step debug */


-----Original Message-----
From: Roger Cruz
Sent: Wed 5/19/2010 12:36 PM
To: Roger Cruz; Keir Fraser; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

Keir,

Following your recommendation to store the variables that are being checked on the BUG_ON, here is what I found to be different.

Upon platform boot. These are my base values.

(XEN)     vmx_vmexit_control = 0xfefff
(XEN)     vmx_vmentry_control = 0x51ff

(XEN)     _vmx_vmexit_control = 0xfefff
(XEN)     _vmx_vmentry_control = 0x51ff

When the sleep code is entered at "int acpi_enter_sleep(struct xenpf_enter_acpi_sleep *sleep)" in power.c, I print out the values as well.

(XEN) *** ACPI Enter Sleep has been called
(XEN)     vmx_vmexit_control = 0x3efff
(XEN)     vmx_vmentry_control = 0x11ff
(XEN)
(XEN)     _vmx_vmexit_control = 0xfefff
(XEN)     _vmx_vmentry_control = 0x51ff

At the time the hvm_cpu_up returns (hvm_cpu_up is where the BUG_ON code is invoked), I also print the values.

(XEN)     vmx_vmexit_control = 0x3efff
(XEN)     vmx_vmentry_control = 0x11ff

(XEN)     _vmx_vmexit_control = 0xfefff
(XEN)     _vmx_vmentry_control = 0x51ff

As one can see here, even before entering the sleep code, the "saved"  vmx_vmexit_control and vmx_vmentry_control variables against which we compare upon wakeup, have a few different bits.  The only place I found in the code that twiddles these bits is in vmcs.c in "static int construct_vmcs(struct vcpu *v)"

    if ( paging_mode_hap(d) )
    {
        v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
                                          CPU_BASED_CR3_LOAD_EXITING |
                                          CPU_BASED_CR3_STORE_EXITING);
    }
    else
    {
        v->arch.hvm_vmx.secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
        vmx_vmexit_control &= ~(VM_EXIT_SAVE_GUEST_PAT |
                                VM_EXIT_LOAD_HOST_PAT);
        vmx_vmentry_control &= ~VM_ENTRY_LOAD_GUEST_PAT;
    }

Roger R. Cruz


-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf of Roger Cruz
Sent: Wed 5/19/2010 10:30 AM
To: Keir Fraser; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

Keir and Jan,

Thank you for responding to my message.  Here is some additional info that may be of interest.

1) The problem has been reproduced on Inspiron Dell 1564 and 1764 models.  They do not have a serial port, so tracing of any sort has been impossible.  The system reboots when resuming from sleep so any in-memory-state is also lost.  If you have any suggestions on other tracing mechanisms I'm all ears.  It has been very time-consuming to do it my way (see below).

2) The way I narrow down the problem to these lines of code was by inserting a "while(1);" loop at different points in the code.  When it didn't reboot, I knew it had gotten to my while loop.  I just kept moving the while loop until I found the lines I highlighted in my previous msg.  Below is what my debug code looks like:

        //       if (sleeploop) while(1);  // it did not reboot up to this point
        BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
        //       if (sleeploop) while(1);  // did not reboot up to this piont.
        BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
        //        if (sleeploop) while(1);  // Rebooted before here.
        BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);

3) You can see above that the vmx_vmexit_control check was the point at which the crash/reboot was being triggered.  However, if I commented out just that line, I would still see a reboot.  Only when I commented the whole block out did it finally work.   Is something overwriting the location of these variables such that when I commented out a line of code, it moved the data segment causing a different variable to be overwritten?    I need to be able to explain this behavior.  So I will working towards that today.

4) My initial thoughts were that the BIOS was overwriting some of these locations, so I performed an experiment that I believe rules out the BIOS.  I commented out the code in power.c that puts the CPU into the sleep mode.  This had the effect of going through most of the sleep and wakeup code in power.c (it does not go through all the wakeup.S initialization as well).  When I did this, it still failed to resume from sleep as long as an HVM domain was present.  Here is the diff on power.c

diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/acpi/power.c
--- a/xen-3.4.2/xen/arch/x86/acpi/power.c
+++ b/xen-3.4.2/xen/arch/x86/acpi/power.c
@@ -208,9 +208,11 @@
     switch ( state )
     {
     case ACPI_STATE_S3:
+#if 0
         do_suspend_lowlevel();
         system_reset_counter++;
         error = tboot_s3_resume();
+#endif       
         break;
     case ACPI_STATE_S5:
         acpi_enter_sleep_state(ACPI_STATE_S5);

5) The problem occurs even when Xen is run in uni-processor mode.  I achieved this by adding "nosmp=1 maxcpus=1" to the grub command line that boots xen.  I confirmed that Xen only reported one physical CPU, namely CPU0.  This should have avoided any issues with waking up other non-boot processors.

6) Finally, I narrowed down the type of domain and condition of the domain that would exhibit the problem, by using python to create a domain with me being able to control its definition.  If I set "flags" to 0, the problem is does not show up.  If I set it to "1" (hvm) and do NOT execute the "xc.domain_max_vcpus" call, the problem does not show up.  However, once I add one VCPU to this domain, the problem occurs.

#! /usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages')
import xen.lowlevel.xc
from xen.xend import uuid
xc = xen.lowlevel.xc.xc()
domid=xc.domain_create(domid=0,ssidref=0,handle=uuid.fromString("bad0beef-dead-beef-dead-beefdeadbeef"), flags=1)

print domid
xc.domain_max_vcpus(domid, 1)


Roger R. Cruz



-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
Sent: Wed 5/19/2010 3:25 AM
To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

On 18/05/2010 23:34, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:

> A little more info.  I am now able to wake up the Dell Inspiron 1764 after I
> put it to sleep.  I found that the code commented out below would cause the
> problems in my system.  I have yet to understand why these variables don't end
> up with the expected values.  If anyone has any thoughts that they would like
> to share on how this code works and why it is comparing to stored variables, I
> would very much like to hear them.

The BUG_ONs are to detect VMX versioning inconsistencies between processors.
The weird thing here is that you presumably brought all CPUs online during
initial system boto with no problem. So somehow something has changed only
after resume from S3. I think you will need to add tracing to discover which
BUG_ON is failing, and why.

Incidentally, in my CPU hotplug cleanup I will be making it so that CPUs
that fail the checks will fail to come online, rather than crash the system.
Which is a bit of an improvement, but obviously something is buggy
underlying this (possibly in BIOS code).

 -- Keir

> Thank you
> Roger R. Cruz
>
>
> diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
> --- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>
> @@ -191,19 +192,25 @@
>          cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
>          vmx_display_features();
>      }
> +#if 0
>      else
>      {
>          /* Globals are already initialised: re-check them. */
>          BUG_ON(vmcs_revision_id != vmx_basic_msr_low);
>          BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
>          BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
>          BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
>          BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
>          BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
>          BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
>                 !!(vmx_basic_msr_high & (1U<<22)));
>      }
>
> +#endif
>      /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
>      BUG_ON((vmx_basic_msr_high & 0x1fff) > PAGE_SIZE);
>
>
> -----Original Message-----
> From: Roger Cruz
> Sent: Wed 5/12/2010 2:38 PM
> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>
>
> We have made some progress in getting the inspiron laptops to work under Xen.
> We tried xenunstable and xen-4.0.0 and discovered that xenunstable can resume
> whereas xen-4.0.0 cannot.  Through trial and error, we have been able to
> narrow down the actual changes that allowed it to work.  It looks like moving
> the trampoline code down from its 0x8c000 location allowed it to resume.
>
> So we took the change below and applied it to our 3.4.2 tree.  However, we
> still have a problem in our 3.4.2 tree with this patch applied.  If an HVM
> guest is running, the resume will fail with the exact same behavior as before.
> Due to our environment setup, we have not been able to test xenunstable with
> an HVM guest, so we can't say if this problem is fixed in xenunstable or not.
> Can someone familiar with these changes provide a clue as to what is going on?
> how does having an HVM guest running affect the resume functionality?  Running
> PV linux guests does not affect resume, only HVM guests do.
>
>
> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
> -0400
> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
> -0400
> @@ -96,7 +96,7 @@
>  /* Primary stack is restricted to 8kB by guard pages. */
>  #define PRIMARY_STACK_SIZE 8192
>
> -#define BOOT_TRAMPOLINE 0x8c000
> +#define BOOT_TRAMPOLINE 0x7c000
>  #define bootsym_phys(sym)                                 \
>      (((unsigned long)&(sym)-(unsigned
> long)&trampoline_start)+BOOT_TRAMPOLINE)
>  #define bootsym(sym)                                      \
>
>
>
> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
> -0400
> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
> -0400
> @@ -96,7 +96,7 @@
>  /* Primary stack is restricted to 8kB by guard pages. */
>  #define PRIMARY_STACK_SIZE 8192
>
> -#define BOOT_TRAMPOLINE 0x8c000
> +#define BOOT_TRAMPOLINE 0x7c000
>  #define bootsym_phys(sym)                                 \
>      (((unsigned long)&(sym)-(unsigned
> long)&trampoline_start)+BOOT_TRAMPOLINE)
>  #define bootsym(sym)                                      \
>
> -------
>
> Hello fellow Xen developers,
>
> I'm about to start debugging why Dell Inspirons running Xen 3.4.2 fail to
> resume after a suspend operation.  A colleague has also found that the problem
> exists on bare-metal Linux
> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/571422) and an upstream
> patch has been created
> (http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commitdiff;h=29c60cc
> c1a408371885d79d8f8c081fbcb9b10be).
>
> I would like to find out if anyone in the Xen community has encountered this
> problem and if a fix is in the works.  Otherwise, I will attempt to provide a
> similar solution to Linux's patch.
>
> thanks
> Roger
>
>
>







_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel