Keir and Jan,
Thank you for responding to my message. Here is some additional info that may be of interest.
1) The problem has been reproduced on Inspiron Dell 1564 and 1764 models. They do not have a serial port, so tracing of any sort has been impossible. The system reboots when resuming from sleep so any in-memory-state is also lost. If you have any suggestions on other tracing mechanisms I'm all ears. It has been very time-consuming to do it my way (see below).
2) The way I narrow down the problem to these lines of code was by inserting a "while(1);" loop at different points in the code. When it didn't reboot, I knew it had gotten to my while loop. I just kept moving the while loop until I found the lines I highlighted in my previous msg. Below is what my debug code looks like:
// if (sleeploop) while(1); // it did not reboot up to this point
BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
// if (sleeploop) while(1); // did not reboot up to this piont.
BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
// if (sleeploop) while(1); // Rebooted before here.
BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
3) You can see above that the vmx_vmexit_control check was the point at which the crash/reboot was being triggered. However, if I commented out just that line, I would still see a reboot. Only when I commented the whole block out did it finally work. Is something overwriting the location of these variables such that when I commented out a line of code, it moved the data segment causing a different variable to be overwritten? I need to be able to explain this behavior. So I will working towards that today.
4) My initial thoughts were that the BIOS was overwriting some of these locations, so I performed an experiment that I believe rules out the BIOS. I commented out the code in power.c that puts the CPU into the sleep mode. This had the effect of going through most of the sleep and wakeup code in power.c (it does not go through all the wakeup.S initialization as well). When I did this, it still failed to resume from sleep as long as an HVM domain was present. Here is the diff on power.c
diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/acpi/power.c
--- a/xen-3.4.2/xen/arch/x86/acpi/power.c
+++ b/xen-3.4.2/xen/arch/x86/acpi/power.c
@@ -208,9 +208,11 @@
switch ( state )
{
case ACPI_STATE_S3:
+#if 0
do_suspend_lowlevel();
system_reset_counter++;
error = tboot_s3_resume();
+#endif
break;
case ACPI_STATE_S5:
acpi_enter_sleep_state(ACPI_STATE_S5);
5) The problem occurs even when Xen is run in uni-processor mode. I achieved this by adding "nosmp=1 maxcpus=1" to the grub command line that boots xen. I confirmed that Xen only reported one physical CPU, namely CPU0. This should have avoided any issues with waking up other non-boot processors.
6) Finally, I narrowed down the type of domain and condition of the domain that would exhibit the problem, by using python to create a domain with me being able to control its definition. If I set "flags" to 0, the problem is does not show up. If I set it to "1" (hvm) and do NOT execute the "xc.domain_max_vcpus" call, the problem does not show up. However, once I add one VCPU to this domain, the problem occurs.
#! /usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages')
import xen.lowlevel.xc
from xen.xend import uuid
xc = xen.lowlevel.xc.xc()
domid=xc.domain_create(domid=0,ssidref=0,handle=uuid.fromString("bad0beef-dead-beef-dead-beefdeadbeef"), flags=1)
print domid
xc.domain_max_vcpus(domid, 1)
Roger R. Cruz
-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
Sent: Wed 5/19/2010 3:25 AM
To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
On 18/05/2010 23:34, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
> A little more info. I am now able to wake up the Dell Inspiron 1764 after I
> put it to sleep. I found that the code commented out below would cause the
> problems in my system. I have yet to understand why these variables don't end
> up with the expected values. If anyone has any thoughts that they would like
> to share on how this code works and why it is comparing to stored variables, I
> would very much like to hear them.
The BUG_ONs are to detect VMX versioning inconsistencies between processors.
The weird thing here is that you presumably brought all CPUs online during
initial system boto with no problem. So somehow something has changed only
after resume from S3. I think you will need to add tracing to discover which
BUG_ON is failing, and why.
Incidentally, in my CPU hotplug cleanup I will be making it so that CPUs
that fail the checks will fail to come online, rather than crash the system.
Which is a bit of an improvement, but obviously something is buggy
underlying this (possibly in BIOS code).
-- Keir
> Thank you
> Roger R. Cruz
>
>
> diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
> --- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>
> @@ -191,19 +192,25 @@
> cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
> vmx_display_features();
> }
> +#if 0
> else
> {
> /* Globals are already initialised: re-check them. */
> BUG_ON(vmcs_revision_id != vmx_basic_msr_low);
> BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
> BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
> BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
> BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
> BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
> BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
> !!(vmx_basic_msr_high & (1U<<22)));
> }
>
> +#endif
> /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
> BUG_ON((vmx_basic_msr_high & 0x1fff) > PAGE_SIZE);
>
>
> -----Original Message-----
> From: Roger Cruz
> Sent: Wed 5/12/2010 2:38 PM
> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>
>
> We have made some progress in getting the inspiron laptops to work under Xen.
> We tried xenunstable and xen-4.0.0 and discovered that xenunstable can resume
> whereas xen-4.0.0 cannot. Through trial and error, we have been able to
> narrow down the actual changes that allowed it to work. It looks like moving
> the trampoline code down from its 0x8c000 location allowed it to resume.
>
> So we took the change below and applied it to our 3.4.2 tree. However, we
> still have a problem in our 3.4.2 tree with this patch applied. If an HVM
> guest is running, the resume will fail with the exact same behavior as before.
> Due to our environment setup, we have not been able to test xenunstable with
> an HVM guest, so we can't say if this problem is fixed in xenunstable or not.
> Can someone familiar with these changes provide a clue as to what is going on?
> how does having an HVM guest running affect the resume functionality? Running
> PV linux guests does not affect resume, only HVM guests do.
>
>
> --- old/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.243564976
> -0400
> +++ new/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.026578602
> -0400
> @@ -96,7 +96,7 @@
> /* Primary stack is restricted to 8kB by guard pages. */
> #define PRIMARY_STACK_SIZE 8192
>
> -#define BOOT_TRAMPOLINE 0x8c000
> +#define BOOT_TRAMPOLINE 0x7c000
> #define bootsym_phys(sym) \
> (((unsigned long)&(sym)-(unsigned
> long)&trampoline_start)+BOOT_TRAMPOLINE)
> #define bootsym(sym) \
>
>
>
> --- old/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.243564976
> -0400
> +++ new/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.026578602
> -0400
> @@ -96,7 +96,7 @@
> /* Primary stack is restricted to 8kB by guard pages. */
> #define PRIMARY_STACK_SIZE 8192
>
> -#define BOOT_TRAMPOLINE 0x8c000
> +#define BOOT_TRAMPOLINE 0x7c000
> #define bootsym_phys(sym) \
> (((unsigned long)&(sym)-(unsigned
> long)&trampoline_start)+BOOT_TRAMPOLINE)
> #define bootsym(sym) \
>
> -------
>
> Hello fellow Xen developers,
>
> I'm about to start debugging why Dell Inspirons running Xen 3.4.2 fail to
> resume after a suspend operation. A colleague has also found that the problem
> exists on bare-metal Linux
> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/571422) and an upstream
> patch has been created
> (http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commitdiff;h=29c60cc
> c1a408371885d79d8f8c081fbcb9b10be).
>
> I would like to find out if anyone in the Xen community has encountered this
> problem and if a fix is in the works. Otherwise, I will attempt to provide a
> similar solution to Linux's patch.
>
> thanks
> Roger
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|