Re: [Xen-devel] Instability with Xen, interrupt routing frozen,

To:	Andreas Kinzler <ml-xen-devel@xxxxxx>
Subject:	Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast
From:	Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date:	Wed, 29 Sep 2010 12:50:48 -0700
Cc:	xen-devel@xxxxxxxxxxxxxxxxxxx, JBeulich@xxxxxxxxxx, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date:	Wed, 29 Sep 2010 12:51:33 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4CA38093.9070802@xxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<4C88A6F3.9020207@xxxxxx> <20100921115604.GP2804@xxxxxxxxxxx> <4CA38093.9070802@xxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100921 Fedora/3.1.4-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.4

 On 09/29/2010 11:08 AM, Andreas Kinzler wrote:
> On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
>>>   I am talking a while (via email) with Jan now to track the following
>>> problem and he suggested that I report the problem on xen-devel:
>>>
>>> Jul  9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCSI
>>> hang ?
>>> Jul  9 01:49:05 virt kernel: aacraid: SCSI bus appears hung
>>> Jul  9 01:49:10 virt kernel: Calling adapter init
>>> Jul  9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
>>> guaranteed on shared IRQs
>>> Jul  9 01:49:49 virt kernel: Acquiring adapter information
>>> Jul  9 01:49:49 virt kernel: update_interval=30:00
>>> check_interval=86400s
>>> Jul  9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronous
>>> command timed out.
>>> Jul  9 01:53:13 virt kernel: Usually a result of a PCI interrupt
>>> routing
>>> problem;
>>> Jul  9 01:53:13 virt kernel: update mother board BIOS or consider
>>> utilizing one of
>>> Jul  9 01:53:13 virt kernel: the SAFE mode kernel options (acpi,
>>> apic etc)
>>>
>>> After the VMs have been running a while the aacraid driver reports a
>>> non-responding RAID controller. Most of the time the NIC is also no
>>> longer working.
>>> I nearly tried every combination of dom0 kernel (pvops0, xenfied suse
>>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
>>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
>>> No success in two month. Every combination earlier or later had the
>>> problem shown above. I did extensive tests to make sure that the
>>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem.
>>>
>>> Jan suggested to try the fix in c/s 22051 but it did not help. My
>>> answer
>>> to him:
>>>
>>>> In the meantime I did try xen-unstable c/s 22068 (contains staging c/s
>>> 22051) and
>>>> it did not fix the problem at all. I was able to fix a problem with
>>> the serial console
>>>> and so I got some debug info that is attached to this email. The
>>> following line looks
>>>> suspicious to me (irr=1, delivery_status=1):
>>>
>>>> (XEN)     IRQ 16 Vec216:
>>>> (XEN)       Apic 0x00, Pin 16: vector=216, delivery_mode=1,
>>> dest_mode=logical,
>>>>              delivery_status=1, polarity=1, irr=1, trigger=level,
>>> mask=0, dest_id:1
>>>
>>>> IRQ 16 is the aacraid controller which after some while seems to be
>>> enable to receive
>>>> interrupts. Can you see from the debug info what is going on?
>>>
>>> I also applied a small patch which disables HPET broadcast. The machine
>>> is now running
>>> for 110 hours without a crash while normally it crashes within a few
>>> minutes. Is there
>>> something wrong (race, deadlock) with HPET broadcasts in relation to
>>> blocked interrupt
>>> reception (see above)?
>> What kind of hardware does this happen on?
>
> It is a Supermicro X8SIL-F, Intel Xeon 3450 system.

That's exactly what my main test/devel machine is.  It has been very
stable for me with xen-unstable.  Is 4.0.1 different from xen-unstable
with respect to HPET?

The big problem I had initially was instability with the integrated
ethernet until I disabled PCIe ASPM.  The symptom was that the ethernet
devices would disappear (ie, their PCI config space would start to read
all 0xff...)

>> Should this patch be merged?
>
> Not easy to answer. I spend more than 10 weeks searching nearly full
> time for the reason of the stability issues. Finally I was able to
> track it down to the HPET broadcast code.
>
> We need to find the developer of the HPET broadcast code. Then, he
> should try to fix the code. I consider it a quite severe bug as it
> renders Xen nearly useless on affected systems. That is why I (and my
> boss who pays me) spend so much time (developing/fixing Xen is not
> really my core job) and money (buying a E5620 machine just for testing
> Xen).
>
> I think many people on affected systems are having problems. See
> http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.html

Just out of interest, does disabling ASPM help?   I had to disable it in
the BIOS, and set pcie_aspm=off on the kernel command line.

This is a total shot in the dark, but given that we're using identical
systems it seems worth a try.

Thanks,
    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET bro