On 19/05/2010 15:30, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
> 2) The way I narrow down the problem to these lines of code was by inserting a
> "while(1);" loop at different points in the code. When it didn't reboot, I
> knew it had gotten to my while loop. I just kept moving the while loop until
> I found the lines I highlighted in my previous msg. Below is what my debug
> code looks like:
Your system seems to hobble along just fine if you remove the BUG_ON()s, so
why not convert them into printk() warnings? Or if it's too early for
printk, stash some info in memory and printk() it at the very end of S3
resume.
> 3) You can see above that the vmx_vmexit_control check was the point at which
> the crash/reboot was being triggered. However, if I commented out just that
> line, I would still see a reboot. Only when I commented the whole block out
> did it finally work. Is something overwriting the location of these
> variables such that when I commented out a line of code, it moved the data
> segment causing a different variable to be overwritten? I need to be able
> to explain this behavior. So I will working towards that today.
I would assume that more than one of the BUG_ON()s is triggering. So if you
just comment out the first offending one that you find, you instead fall
foul of a second one.
> 4) My initial thoughts were that the BIOS was overwriting some of these
> locations, so I performed an experiment that I believe rules out the BIOS. I
> commented out the code in power.c that puts the CPU into the sleep mode. This
> had the effect of going through most of the sleep and wakeup code in power.c
> (it does not go through all the wakeup.S initialization as well). When I did
> this, it still failed to resume from sleep as long as an HVM domain was
> present. Here is the diff on power.c
Yep, that patch should do the expected thing and do everything except the
actual BIOS S3 transition.
Well, overall this does sound like a memory corruption issue, not a BIOS or
platform issue. You need to printk out the contents of variables
contributing to your failing BUG_ON()s and see what's written there, I
think.
-- Keir
> 5) The problem occurs even when Xen is run in uni-processor mode. I achieved
> this by adding "nosmp=1 maxcpus=1" to the grub command line that boots xen. I
> confirmed that Xen only reported one physical CPU, namely CPU0. This should
> have avoided any issues with waking up other non-boot processors.
>
> 6) Finally, I narrowed down the type of domain and condition of the domain
> that would exhibit the problem, by using python to create a domain with me
> being able to control its definition. If I set "flags" to 0, the problem is
> does not show up. If I set it to "1" (hvm) and do NOT execute the
> "xc.domain_max_vcpus" call, the problem does not show up. However, once I add
> one VCPU to this domain, the problem occurs.
>
> #! /usr/bin/python
> import sys
> sys.path.append('/usr/lib/python2.6/site-packages')
> import xen.lowlevel.xc
> from xen.xend import uuid
> xc = xen.lowlevel.xc.xc()
> domid=xc.domain_create(domid=0,ssidref=0,handle=uuid.fromString("bad0beef-dead
> -beef-dead-beefdeadbeef"), flags=1)
>
> print domid
> xc.domain_max_vcpus(domid, 1)
>
>
> Roger R. Cruz
>
>
>
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
> Sent: Wed 5/19/2010 3:25 AM
> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>
> On 18/05/2010 23:34, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
>
>> A little more info. I am now able to wake up the Dell Inspiron 1764 after I
>> put it to sleep. I found that the code commented out below would cause the
>> problems in my system. I have yet to understand why these variables don't
>> end
>> up with the expected values. If anyone has any thoughts that they would like
>> to share on how this code works and why it is comparing to stored variables,
>> I
>> would very much like to hear them.
>
> The BUG_ONs are to detect VMX versioning inconsistencies between processors.
> The weird thing here is that you presumably brought all CPUs online during
> initial system boto with no problem. So somehow something has changed only
> after resume from S3. I think you will need to add tracing to discover which
> BUG_ON is failing, and why.
>
> Incidentally, in my CPU hotplug cleanup I will be making it so that CPUs
> that fail the checks will fail to come online, rather than crash the system.
> Which is a bit of an improvement, but obviously something is buggy
> underlying this (possibly in BIOS code).
>
> -- Keir
>
>> Thank you
>> Roger R. Cruz
>>
>>
>> diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> --- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> +++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>>
>> @@ -191,19 +192,25 @@
>> cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
>> vmx_display_features();
>> }
>> +#if 0
>> else
>> {
>> /* Globals are already initialised: re-check them. */
>> BUG_ON(vmcs_revision_id != vmx_basic_msr_low);
>> BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
>> BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
>> BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
>> BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
>> BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
>> BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
>> !!(vmx_basic_msr_high & (1U<<22)));
>> }
>>
>> +#endif
>> /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
>> BUG_ON((vmx_basic_msr_high & 0x1fff) > PAGE_SIZE);
>>
>>
>> -----Original Message-----
>> From: Roger Cruz
>> Sent: Wed 5/12/2010 2:38 PM
>> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>>
>>
>> We have made some progress in getting the inspiron laptops to work under Xen.
>> We tried xenunstable and xen-4.0.0 and discovered that xenunstable can resume
>> whereas xen-4.0.0 cannot. Through trial and error, we have been able to
>> narrow down the actual changes that allowed it to work. It looks like moving
>> the trampoline code down from its 0x8c000 location allowed it to resume.
>>
>> So we took the change below and applied it to our 3.4.2 tree. However, we
>> still have a problem in our 3.4.2 tree with this patch applied. If an HVM
>> guest is running, the resume will fail with the exact same behavior as
>> before.
>> Due to our environment setup, we have not been able to test xenunstable with
>> an HVM guest, so we can't say if this problem is fixed in xenunstable or not.
>> Can someone familiar with these changes provide a clue as to what is going
>> on?
>> how does having an HVM guest running affect the resume functionality?
>> Running
>> PV linux guests does not affect resume, only HVM guests do.
>>
>>
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>> /* Primary stack is restricted to 8kB by guard pages. */
>> #define PRIMARY_STACK_SIZE 8192
>>
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>> #define bootsym_phys(sym) \
>> (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>> #define bootsym(sym) \
>>
>>
>>
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h 2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>> /* Primary stack is restricted to 8kB by guard pages. */
>> #define PRIMARY_STACK_SIZE 8192
>>
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>> #define bootsym_phys(sym) \
>> (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>> #define bootsym(sym) \
>>
>> -------
>>
>> Hello fellow Xen developers,
>>
>> I'm about to start debugging why Dell Inspirons running Xen 3.4.2 fail to
>> resume after a suspend operation. A colleague has also found that the
>> problem
>> exists on bare-metal Linux
>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/571422) and an upstream
>> patch has been created
>>
(http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commitdiff;h=29c60c>>
c
>> c1a408371885d79d8f8c081fbcb9b10be).
>>
>> I would like to find out if anyone in the Xen community has encountered this
>> problem and if a fix is in the works. Otherwise, I will attempt to provide a
>> similar solution to Linux's patch.
>>
>> thanks
>> Roger
>>
>>
>>
>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|