|
|
|
|
|
|
|
|
|
|
xen-devel
Re: [Xen-devel] Network dies and kernel errors
On Friday, July 29, 2011 10:03:48 am Konrad Rzeszutek Wilk wrote:
> On Mon, Jul 25, 2011 at 02:18:21PM -0500, John McMonagle wrote:
> > Have a new amd 6100 based server.
> > http://www.supermicro.com/Aplus/system/2U/2022/AS-2022G-URF.cfm
> > Running debian squeeze with debian 2.6.32 xen kernel
> > Running xen 4.1.1 built from source from xen.org
> >
> > I'm seeing 2 errors.
> > during boot get this:
> >
> > [ 0.004823] ------------[ cut here ]------------
> > [ 0.004833] WARNING:
> > at
> > /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/b
> > uild/source_amd64_xen/arch/x86/xen/enlighten.c:726
> > init_hw_perf_events+0x32d/0x3cd()
> > [ 0.004838] Hardware name: H8DGU
> > [ 0.004841] Modules linked in:
> > [ 0.004847] Pid: 0, comm: swapper Not tainted 2.6.32-5-xen-amd64 #1
> > [ 0.004850] Call Trace:
> > [ 0.004857] [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [ 0.004862] [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [ 0.004870] [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> > [ 0.004875] [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [ 0.004881] [<ffffffff813044dc>] ? identify_cpu+0x2f7/0x300
> > [ 0.004888] [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [ 0.004895] [<ffffffff810e81d5>] ? kmem_cache_alloc+0x8c/0xf0
> > [ 0.004900] [<ffffffff81510a16>] ? identify_boot_cpu+0x15/0x3e
> > [ 0.004904] [<ffffffff81510baa>] ? check_bugs+0x9/0x2e
> > [ 0.004910] [<ffffffff81509cce>] ? start_kernel+0x3cd/0x3e8
> > [ 0.004915] [<ffffffff8150bc93>] ? xen_start_kernel+0x586/0x58a
>
> You can ignore that one. It just means that you can't do profiling which we
> haven't yet up-ported.
>
> ..
>
> > Then next one may not be xen but I only had the problem after running a
> > domu. After a while I get kernel error and networking stops.
>
> And some other user with a bnx2 driver seems to see a similar problem. Let
> me CC them here.
>
> > This is the error:
> > [ 1411.813376] ------------[ cut here ]------------
> > [ 1411.813398] WARNING:
> > at
> > /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/b
> > uild/source_amd64_xen/net/sched/s ch_generic.c:261
> > dev_watchdog+0xe2/0x194()
>
> OK, this is one is more worrysome.
>
> > [ 1411.813410] Hardware name: H8DGU
> > [ 1411.813417] NETDEV WATCHDOG: peth0 (igb): transmit queue 1 timed out
> > [ 1411.813424] Modules linked in: xt_physdev iptable_filter tun ip_tables
> > x_tables bridge stp sg sr_mod cdrom xfs exportfs ipmi_si i
> > pmi_devintf ipmi_watchdog ipmi_msghandler xen_evtchn blktap xenfs loop
> > snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr psmouse joydev
> > evdev serio_raw i2c_piix
> > 4 edac_core k10temp edac_mce_amd i2c_core processor button acpi_processor
> > ext4 mbcache jbd2 crc16 usbhid hid dm_mod raid1 md_mod sd_mod crc_t10dif
> > ata_generic usb_s
> > torage pata_atiixp ahci ohci_hcd libata ehci_hcd usbcore nls_base
> > scsi_mod igb dca thermal thermal_sys [last unloaded: scsi_wait_scan]
> >
> > [ 1411.813656] Pid: 4, comm: ksoftirqd/0 Tainted: G W
> >
> > 2.6.32-5-xen-amd64 #1
> > [ 1411.813664] Call Trace:
> > [ 1411.813671] <IRQ> [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813697] [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813711] [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> > [ 1411.813724] [<ffffffff81272d60>] ? dev_watchdog+0x0/0x194
> > [ 1411.813736] [<ffffffff8104ef88>] ? warn_slowpath_fmt+0x51/0x59
> > [ 1411.813751] [<ffffffff8130d42a>] ? _spin_unlock_irqrestore+0xd/0xe
> > [ 1411.813762] [<ffffffff8104b41e>] ? try_to_wake_up+0x289/0x29b
> > [ 1411.813778] [<ffffffff81272d34>] ? netif_tx_lock+0x3d/0x69
> > [ 1411.813791] [<ffffffff8125d7da>] ? netdev_drivername+0x3b/0x40
> > [ 1411.813803] [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813816] [<ffffffff8100ece2>] ? check_events+0x12/0x20
> > [ 1411.813827] [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268
> > [ 1411.813841] [<ffffffff8105b5ef>] ? run_timer_softirq+0x1c9/0x268
> > [ 1411.813855] [<ffffffff81054c9b>] ? __do_softirq+0xdd/0x1a6
> > [ 1411.813867] [<ffffffff81012cac>] ? call_softirq+0x1c/0x30
> > [ 1411.813873] <EOI> [<ffffffff8101422b>] ? do_softirq+0x3f/0x7c
> > [ 1411.813893] [<ffffffff810548c2>] ? ksoftirqd+0x5f/0xd3
> > [ 1411.813905] [<ffffffff81054863>] ? ksoftirqd+0x0/0xd3
> > [ 1411.813915] [<ffffffff81065c39>] ? kthread+0x79/0x81
> > [ 1411.813926] [<ffffffff81012baa>] ? child_rip+0xa/0x20
> > [ 1411.813937] [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
> > [ 1411.813948] [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
> > [ 1411.813958] [<ffffffff81012ba0>] ? child_rip+0x0/0x20
> > [ 1411.813966] ---[ end trace a7919e7f17c0a727 ]---
> > [ 1412.052253] eth0: port 1(peth0) entering disabled state
> > [ 1635.796207] frontend_changed: backend/vbd/3/768: prepare for reconnect
> > [ 1647.137513] eth0: port 3(vif3.0) entering disabled state
> > [ 1647.157527] eth0: port 3(vif3.0) entering disabled state
> >
> > Kernel logging (proc) stopped.
> >
> > In this case dom0 locked up. Some times just networking stops and some
> > times networking recovers.
> >
> > Looks like it uses msi-x interrupts.
> >
> > Concerning igb error I have tried the following one at a time:
> > New igb driver from Intel site.
> > kernel parameter pcie_aspm=off
> > ethtool -K eth0 tx off on dom0
> > ethtool -K eth0 gro off on dom0
>
> OK.
>
> > It has never died doing iperf from dom0 or domu <> external.
> > Never died during network backup.
> >
> > Usually takes a least a few hours and has never made it a day running a
> > domu. Wish I could get it to die faster :-)
> > Any ideas?
> > I'm pretty much down to trying different network cards
>
> Did you try that? Did that make any difference?
Not tested I did install one.
I think I found a way to keep it running.
On the new igb driver I built from new intel source added module parameter
IntMode=1.
This puts it in msi mode. It was in msi-x mode.
It's never died with that setting.
It's up now over a day.
No real experience with msi-x. I think it's the first time I have seen a
driver use msi-x interrupts.
Maybe that gives you more ideas?
>
> > Any ideas?
>
> There is a Xen parameter called 'noirqbalance' . Try that. Also see if you
> can limit the CPUs in the dom0 using these two arguments on Xen
> hypervisor:
>
Should I turn off the irqbalence daemon also?
Just in case you wonder it does with out it.
> dom0_vcpus=2 dom0_vcpus_pin=1
>
>
> It would be interesting to narrow down _when_ you trigger this failure. B/c
> we can pull Xen to see what the MSI's are 'xl debug-keys M' _before_ and
> _after_ your failure to see if something is amiss.
>
> Mainly to figure out if the vectors are moving around the CPUs (or not)
>
> (XEN) MSI 29 vec=21 lowest edge assert log lowest dest=00000001
> mask=0/0/-1
>
> and also 'xl debug-keys i' to see if the domain has ACK-ed the interrupt:
> (XEN) IRQ: 29 affinity:00000000,00000000,00000000,00000001 vec:21
> type=PCI-MSI status=00000010 in-flight=0 domain-list=0:275(----),
>
> (the last '----' might have something else in in them - if so that is a
> sign that dom0 hasn't picked up the event/vector).
Much of my frustration is that I have not found a way to get it to fail other
than waiting a long time :-(
Thanks
John
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|
|
|