WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764

To: Roger Cruz <roger.cruz@xxxxxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
From: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Date: Wed, 19 May 2010 15:50:07 +0100
Cc:
Delivery-date: Wed, 19 May 2010 07:51:23 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <EACA7CA90354A849B1315959042A052C26F377@xxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acrr2MGaf7yfnFaQTZycCDfaoiD3fwGKD9vEATZRItYAEo2wXAAOFwx+AAFuf4A=
Thread-topic: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
User-agent: Microsoft-Entourage/12.24.0.100205
On 19/05/2010 15:30, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:

> 2) The way I narrow down the problem to these lines of code was by inserting a
> "while(1);" loop at different points in the code.  When it didn't reboot, I
> knew it had gotten to my while loop.  I just kept moving the while loop until
> I found the lines I highlighted in my previous msg.  Below is what my debug
> code looks like:

Your system seems to hobble along just fine if you remove the BUG_ON()s, so
why not convert them into printk() warnings? Or if it's too early for
printk, stash some info in memory and printk() it at the very end of S3
resume.

> 3) You can see above that the vmx_vmexit_control check was the point at which
> the crash/reboot was being triggered.  However, if I commented out just that
> line, I would still see a reboot.  Only when I commented the whole block out
> did it finally work.   Is something overwriting the location of these
> variables such that when I commented out a line of code, it moved the data
> segment causing a different variable to be overwritten?    I need to be able
> to explain this behavior.  So I will working towards that today.

I would assume that more than one of the BUG_ON()s is triggering. So if you
just comment out the first offending one that you find, you instead fall
foul of a second one.

> 4) My initial thoughts were that the BIOS was overwriting some of these
> locations, so I performed an experiment that I believe rules out the BIOS.  I
> commented out the code in power.c that puts the CPU into the sleep mode.  This
> had the effect of going through most of the sleep and wakeup code in power.c
> (it does not go through all the wakeup.S initialization as well).  When I did
> this, it still failed to resume from sleep as long as an HVM domain was
> present.  Here is the diff on power.c

Yep, that patch should do the expected thing and do everything except the
actual BIOS S3 transition.

Well, overall this does sound like a memory corruption issue, not a BIOS or
platform issue. You need to printk out the contents of variables
contributing to your failing BUG_ON()s and see what's written there, I
think.

 -- Keir

> 5) The problem occurs even when Xen is run in uni-processor mode.  I achieved
> this by adding "nosmp=1 maxcpus=1" to the grub command line that boots xen.  I
> confirmed that Xen only reported one physical CPU, namely CPU0.  This should
> have avoided any issues with waking up other non-boot processors.
> 
> 6) Finally, I narrowed down the type of domain and condition of the domain
> that would exhibit the problem, by using python to create a domain with me
> being able to control its definition.  If I set "flags" to 0, the problem is
> does not show up.  If I set it to "1" (hvm) and do NOT execute the
> "xc.domain_max_vcpus" call, the problem does not show up.  However, once I add
> one VCPU to this domain, the problem occurs.
> 
> #! /usr/bin/python
> import sys
> sys.path.append('/usr/lib/python2.6/site-packages')
> import xen.lowlevel.xc
> from xen.xend import uuid
> xc = xen.lowlevel.xc.xc()
> domid=xc.domain_create(domid=0,ssidref=0,handle=uuid.fromString("bad0beef-dead
> -beef-dead-beefdeadbeef"), flags=1)
> 
> print domid
> xc.domain_max_vcpus(domid, 1)
> 
> 
> Roger R. Cruz
> 
> 
> 
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
> Sent: Wed 5/19/2010 3:25 AM
> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
> 
> On 18/05/2010 23:34, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>> A little more info.  I am now able to wake up the Dell Inspiron 1764 after I
>> put it to sleep.  I found that the code commented out below would cause the
>> problems in my system.  I have yet to understand why these variables don't
>> end
>> up with the expected values.  If anyone has any thoughts that they would like
>> to share on how this code works and why it is comparing to stored variables,
>> I
>> would very much like to hear them.
> 
> The BUG_ONs are to detect VMX versioning inconsistencies between processors.
> The weird thing here is that you presumably brought all CPUs online during
> initial system boto with no problem. So somehow something has changed only
> after resume from S3. I think you will need to add tracing to discover which
> BUG_ON is failing, and why.
> 
> Incidentally, in my CPU hotplug cleanup I will be making it so that CPUs
> that fail the checks will fail to come online, rather than crash the system.
> Which is a bit of an improvement, but obviously something is buggy
> underlying this (possibly in BIOS code).
> 
>  -- Keir
> 
>> Thank you
>> Roger R. Cruz
>> 
>> 
>> diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> --- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> +++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> 
>> @@ -191,19 +192,25 @@
>>          cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
>>          vmx_display_features();
>>      }
>> +#if 0
>>      else
>>      {
>>          /* Globals are already initialised: re-check them. */
>>          BUG_ON(vmcs_revision_id != vmx_basic_msr_low);
>>          BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
>>          BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
>>          BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
>>          BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
>>          BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
>>          BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
>>                 !!(vmx_basic_msr_high & (1U<<22)));
>>      }
>> 
>> +#endif
>>      /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
>>      BUG_ON((vmx_basic_msr_high & 0x1fff) > PAGE_SIZE);
>> 
>> 
>> -----Original Message-----
>> From: Roger Cruz
>> Sent: Wed 5/12/2010 2:38 PM
>> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>> 
>> 
>> We have made some progress in getting the inspiron laptops to work under Xen.
>> We tried xenunstable and xen-4.0.0 and discovered that xenunstable can resume
>> whereas xen-4.0.0 cannot.  Through trial and error, we have been able to
>> narrow down the actual changes that allowed it to work.  It looks like moving
>> the trampoline code down from its 0x8c000 location allowed it to resume.
>> 
>> So we took the change below and applied it to our 3.4.2 tree.  However, we
>> still have a problem in our 3.4.2 tree with this patch applied.  If an HVM
>> guest is running, the resume will fail with the exact same behavior as
>> before.
>> Due to our environment setup, we have not been able to test xenunstable with
>> an HVM guest, so we can't say if this problem is fixed in xenunstable or not.
>> Can someone familiar with these changes provide a clue as to what is going
>> on?
>> how does having an HVM guest running affect the resume functionality?
>> Running
>> PV linux guests does not affect resume, only HVM guests do.
>> 
>> 
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>>  /* Primary stack is restricted to 8kB by guard pages. */
>>  #define PRIMARY_STACK_SIZE 8192
>> 
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>>  #define bootsym_phys(sym)                                 \
>>      (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>>  #define bootsym(sym)                                      \
>> 
>> 
>> 
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>>  /* Primary stack is restricted to 8kB by guard pages. */
>>  #define PRIMARY_STACK_SIZE 8192
>> 
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>>  #define bootsym_phys(sym)                                 \
>>      (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>>  #define bootsym(sym)                                      \
>> 
>> -------
>> 
>> Hello fellow Xen developers,
>> 
>> I'm about to start debugging why Dell Inspirons running Xen 3.4.2 fail to
>> resume after a suspend operation.  A colleague has also found that the
>> problem
>> exists on bare-metal Linux
>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/571422) and an upstream
>> patch has been created
>> 
(http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commitdiff;h=29c60c>>
c
>> c1a408371885d79d8f8c081fbcb9b10be).
>> 
>> I would like to find out if anyone in the Xen community has encountered this
>> problem and if a fix is in the works.  Otherwise, I will attempt to provide a
>> similar solution to Linux's patch.
>> 
>> thanks
>> Roger
>> 
>> 
>> 
> 
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel