|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
Monday, November 17, 2014, 5:34:16 PM, you wrote:
> On Fri, Nov 14, 2014 at 11:09:58PM +0100, Sander Eikelenboom wrote:
>>
>> Friday, November 14, 2014, 9:25:13 PM, you wrote:
>>
>> > On Fri, Nov 14, 2014 at 05:59:23PM +0100, Sander Eikelenboom wrote:
>> >>
>> >> Friday, November 14, 2014, 4:43:58 PM, you wrote:
>> >>
>> >> >>>> On 14.11.14 at 16:20, <linux@xxxxxxxxxxxxxx> wrote:
>> >> >> If it still helps i could try Andrews suggestion and try out with only
>> >> >> commit aeeea485 ..
>> >>
>> >> > Yes, even if it's pretty certain it's the second of the commits,
>> >> > verifying
>> >> > this would be helpful (or if the assumption is wrong, the pattern it's
>> >> > dying with would change and hence perhaps provide further clues).
>> >>
>> >> > Jan
>> >>
>> >>
>> >> Ok with a revert of f6dd295 .. it survived cooking and eating a nice bowl
>> >> of
>> >> pasta without a panic. So it would probably be indeed that specific
>> >> commit.
>>
>> > Could you try running with these two patches while you enjoy an beer in
>> > the evening?
>>
>> Hmm i didn't expect it not to panic and reboot anymore :-)
> I should have also asked for your to run with 'iommu=verbose,debug', but
> that can be done later..
I was running with iommu=on,verbose,amd-iommu-debug ..
> The guest d16 looks to have two PCI passthrough devices:
> XEN) [2014-11-14 21:31:26.569] io.c:550: d16: bind: m_gsi=37 g_gsi=36
> dev=00.00.5 intx=0
> XEN) [2014-11-14 21:31:28.095] io.c:550: d16: bind: m_gsi=47 g_gsi=40
> dev=00.00.6 intx=0
> And one of them uses just the GSI while the other uses four MSI-X, is
> that about right?
Yes guest 16 has 1 USB controller(guest side 00:05.0) which has MSI-X enabled,
and 1 conexant video-grabber
(guest side 00:06.0) which should be MSI capable, but is is not enabled
(probably by the driver) so
using legacy interrupts.
> I tried to reproduce that on my AMD box with two NICs:
> # lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton
> II] (rev 01)
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> 00:02.0 VGA compatible controller: Technical Corp. Device 1111
> 00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01)
> 00:04.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 00:05.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> Controller (rev 05)
> # cat /proc/interrupts |grep eth
> 36: 384183 0 xen-pirq-ioapic-level eth0
> 63: 1 0 xen-pirq-msi-x eth1
> 64: 24 661961 xen-pirq-msi-x eth1-rx-0
> 65: 205 0 xen-pirq-msi-x eth1-rx-1
> 66: 162 0 xen-pirq-msi-x eth1-tx-0
> 67: 190 0 xen-pirq-msi-x eth1-tx-1
> Is that a similar distribution of IRQ/MSIx you end up having?
These are when they are still active and assigned to dom0 (and not owned by
pci-back) or in the guest ?
I attached my /proc/interrupts for both dom0 as guest 16 with all guests
running
(on a Xen from before the dpci changes).
With the devices passed through I only see one line with the IRQ of a
PCI soundcard passed through to a PV guest:
22: 38959 0 0 0 0 0
xen-pirq-ioapic-level xen-pciback[0000:03:06.0]
All the other devices passed through (to HVM guests) are not visible in
/proc/interrupts of dom0.
In the guest i do get these:
23: 35 0 0 0 xen-pirq-ioapic-level
uhci_hcd:usb3
40: 13440077 0 0 0 xen-pirq-ioapic-level
cx25821[1], cx25821[1]
84: 2956369 0 0 0 xen-pirq-msi-x xhci_hcd
85: 0 0 0 0 xen-pirq-msi-x xhci_hcd
86: 0 0 0 0 xen-pirq-msi-x xhci_hcd
87: 0 0 0 0 xen-pirq-msi-x xhci_hcd
88: 0 0 0 0 xen-pirq-msi-x xhci_hcd
>>
>> However xl dmesg (complete one attached) showed it would have:
>>
>> (XEN) [2014-11-14 21:35:50.646] --MARK--
>> (XEN) [2014-11-14 21:35:56.861] grant_table.c:305:d0v0 Increased maptrack
>> size to 9 frames
>> (XEN) [2014-11-14 21:36:00.647] --MARK--
>> (XEN) [2014-11-14 21:36:10.410] grant_table.c:1299:d16v1 Expanding dom (16)
>> grant table from (5) to (6) frames.
>> (XEN) [2014-11-14 21:36:10.820] --MARK--
>> (XEN) [2014-11-14 21:36:20.820] --MARK--
>> (XEN) [2014-11-14 21:36:30.820] --MARK--
>> (XEN) [2014-11-14 21:36:40.821] --MARK--
>> (XEN) [2014-11-14 21:36:50.821] --MARK--
>> (XEN) [2014-11-14 21:37:00.388] CPU00:
>> (XEN) [2014-11-14 21:37:00.399] CPU01:
>> (XEN) [2014-11-14 21:37:00.410] d16 OK-softirq 20msec ago, state:1, 41220
>> count, [prev:ffff83054ef5e3e0, next:ffff83054ef5e3e0] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.445] d16 OK-raise 46msec ago, state:1, 41223
>> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.481] d16 ERR-poison 92msec ago, state:0, 1 count,
>> [prev:0000000000200200, next:0000000000100100] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.515] d16 Z-softirq 28853msec ago, state:2, 1
>> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0
> The PIRQ:0 would imply that this is the legacy interrupt - which would be you
> 0a:00.0 device
> (Conexant Systems, Inc. Device 8210).
Correct.
> And it is pounding on this CPU - and the issue is that the
> 'test_and_clear_bit' ends
> up returning 0 - which means it was not able to set STATE_SCHED:
> (!?)
> if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> {
>
> hvm_dirq_assist(d, pirq_dpci);
>
> put_domain(d);
>
> }
>
> else
>
> {
>
> _record(&debug->zombie_softirq, pirq_dpci);
> which causes us to record it [Z-softirq], which says we we are in state 2
> (1<<STATE_RUN).
> reset = 1;
>
> }
> .. eons ago (28853msec).
> Hmm. There is something fishy there but the only theory I have is that
> we end up doing 'list_del' twice on different CPUs on the same structure.
The pounding would be correct .. since it's a videograbber ... wouldn't be
fun not stretching the limits ;-) (however it's running fine for about 2 or 3
years)
> That should not be possible, but then this check - 'test_and_clear_bit'
> returned
> 0 which means that somebody had cleared it (or it failed to clear it?)
> But the only other path for 'clearing' it is via the error paths and you are
> not hitting any of them.
> In the mean-time, could you try this patch. It adds a bit more debug to help
> me figure this out.
Ok will do this evening, thx !
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 23e5ed1..443975c 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -126,17 +126,17 @@ static void dump_record(struct _debug_f *d, unsigned
> int type)
> BUG();
>
> now = NOW();
> - printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] ",
> + printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] %p",
> d->domid, names[type],
> (unsigned long)((now - d->last) / MILLISECS(1)),
- d->>state, d->count, d->list.prev, d->list.next);
+ d->>state, d->count, d->list.prev, d->list.next, d->dpci);
>
> if ( d->dpci )
> {
> struct hvm_pirq_dpci *pirq_dpci = d->dpci;
>
> for ( i = 0; i <= _HVM_IRQ_DPCI_GUEST_MSI_SHIFT; i++ )
> - if ( pirq_dpci->flags & 1 << _HVM_IRQ_DPCI_TRANSLATE_SHIFT )
> + if ( pirq_dpci->flags & (1 << i) )
> printk("%s ", names_flag[i]);
>
> printk(" PIRQ:%d", pirq_dpci->pirq);
Attachment:
proc-interrupts-dom0.txt Attachment:
proc-interrupts-guest.txt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |