On 17/08/2010 18:28, "Bruce Edge" <bruce.edge@xxxxxxxxx> wrote:
> On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@xxxxxxxxxx> wrote:
>>>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@xxxxxxxxx> wrote:
>>> I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0
>>> and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64.
>>> I'm using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre
>>> Channel HBA to domU. After running I/Os for several hours, both dom0 and
>>> domU hangs and the Xen console shows the interrupt binding below where IRQ
>>> 66 shows in-flight=1 and mask set (---M). What's the best way to debug this
>>> problem?
>>
>> There are potentially two problems here: One is that the guest may
>> fail to send the EOI notification. You would want to check whether
>> pirq_guest_eoi() got run after that last occurrence of the interrupt.
>>
>> The more worrying part is that Xen should time out on a guest failing
>> to send the EOI notification, and ack the interrupt nevertheless.
>> Looking at the code I fail to see how the ack_APIC_irq() would get
>> sent in this case: non-maskable MSIs get this issued from
>> end_msi_irq(), but ->end doesn't get invoked from
>> irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing
>> something?
I don't think that timer logic is designed to handle non-maskable MSIs, only
maskable ones. It ought to be not too hard to fix it up for non-maskable
ones too by issuing the ->end() call from the timer handler?
-- Keir
>> Otoh I can't see how this can work reliably in the first place: Since
>> there's no other way to mask such interrupts, sending an ack to the
>> LAPIC could result in an interrupt storm. Disabling MSI on the
>> affected device isn't a good option either, as we know there are
>> devices that switch to legacy IRQ mode irreversibly in that case,
>> and hence the device becomes unusable (presumably until being
>> reset). But very likely this would still be better than hanging the
>> entire box; it probably would just need a more graceful timeout.
>>
>> Jan
>
>
> This is still happening. I have 2 identical boxes that were running a stress
> test and both hung after a few hours. They have identical hardware and
> software configs so I'll report the config for one and attach the xen dump for
> both.
>
> dom0 info:
>
> HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz)
>
> # cat /proc/cmdline
> root=/dev/mapper/system-dom0_0 ro earlyprintk=xen loglevel=10 debug acpi=force
> console=hvc0,115200n8
>
> # uname -a
> Linux dpm8800-09 2.6.32.16 #1 SMP Wed Aug 4 15:38:21 PDT 2010 x86_64 GNU/Linux
>
> The domU is an Ubuntu 10.04 kernel, 2.6.32.15+drm33.5 in hvm mode.
>
> # xm info
> host : dpm8800-09
> release : 2.6.32.16
> version : #1 SMP Wed Aug 4 15:38:21 PDT 2010
> machine : x86_64
> nr_cpus : 16
> nr_nodes : 2
> cores_per_socket : 4
> threads_per_core : 2
> cpu_mhz : 2533
> hw_caps :
> bfebfbff:28100800:00000000:00001b40:009ce3bd:00000000:00000001:00000000
> virt_caps : hvm hvm_directio
> total_memory : 12277
> free_memory : 11631
> node_to_cpu : node0:0,2,4,6,8,10,12,14
> node1:1,3,5,7,9,11,13,15
> node_to_memory : node0:5601
> node1:6029
> node_to_dma32_mem : node0:3506
> node1:0
> max_node_id : 1
> xen_major : 4
> xen_minor : 0
> xen_extra : .1-rc4
> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : unavailable
> xen_commandline : dom0_mem=512M dom0_max_vcpus=1 dom0_vcpus_pin=true
> iommu=1,passthrough,no-intremap loglvl=all loglvl_guest=all loglevl=10 debug
> apic=on apic_verbosity=verbose extra_guest_irqs=80 com1=115200,8n1
> console=com1 console_to_ring xen-pciback.permissive acpi=force numa=on
> cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by : bedge
> cc_compile_domain : lsi.com <http://lsi.com>
> cc_compile_date : Sun Aug 1 09:44:29 PDT 2010
> xend_config_format : 4
>
> This device (as well as a few more of these) is passed through via pciback:
>
> dpm8800-09:~# lspci | grep 10:
> 10:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.1 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.2 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) <- on both cases
> it's this device that loses the interrupt in flight
>
> 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> Flags: bus master, fast devsel, latency 0, IRQ 5
> I/O ports at a800 [size=256]
> I/O ports at ac00 [size=256]
> Memory at fbdc0000 (64-bit, non-prefetchable) [size=32K]
> Capabilities: [50] Power Management version 3
> Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/1 Enable-
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Mask- TabSize=9
> Capabilities: [100] Advanced Error Reporting <?>
>
>
> From host dpm8800-10:
> (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:94
> type=PCI-MSI status=00000050 in-flight=0 domain-list=2:126(----),
> (XEN) IRQ: 134 affinity:00000000,00000000,00000000,00000001 vec:d4
> type=PCI-MSI status=00000050 in-flight=1 domain-list=2:125(---M),
> (XEN) IRQ: 135 affinity:00000000,00000000,00000000,00000004 vec:9c
> type=PCI-MSI status=00000010 in-flight=0 domain-list=2:124(----),
>
> From host dpm8800-09:
> (XEN) IRQ: 131 affinity:00000000,00000000,00000000,00002000 vec:7f
> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 62(----),
> (XEN) IRQ: 132 affinity:00000000,00000000,00000000,00000001 vec:dd
> type=PCI-MSI status=00000010 in-flight=1 domain-list=2:127(---M),
> (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:3e
> type=PCI-MSI status=00000010 in-flight=0 domain-list=2:126(----),
>
> This time both cases correspond to 10:00.3:
>
> (XEN) 10:00.3 - dom 2 - MSIs < 132 >
>
> (XEN) MSI 132 vec=dc fixed edge assert phys cpu dest=00000010
> mask=0/0/-1
>
>
> Let me know if there's anything else I can provide to assist in diagnosing
> this problem.
>
> Thanks
>
> -Bruce
>
>>
>>> (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9
>>> type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M),
>>> (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9
>>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----),
>>> (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22
>>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----),
>>> (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a
>>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----),
>>>
>>> (XEN) 07:00.3 - dom 1 - MSIs < 69 >
>>> (XEN) 07:00.2 - dom 1 - MSIs < 68 >
>>> (XEN) 07:00.1 - dom 1 - MSIs < 67 >
>>> (XEN) 07:00.0 - dom 1 - MSIs < 66 >
>>>
>>> (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000
>>> mask=0/0/-1
>>> (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004
>>> mask=0/0/-1
>>> (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002
>>> mask=0/0/-1
>>> (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006
>>> mask=0/0/-1
>>>
>>> Thanks.
>>>
>>> Dante
>>
>>
>>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|