WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Network dies and kernel errors

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Subject: Re: [Xen-devel] Network dies and kernel errors
From: John McMonagle <johnm@xxxxxxxxxxx>
Date: Fri, 29 Jul 2011 10:38:20 -0500
Cc: tinnycloud@xxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Fri, 29 Jul 2011 08:39:11 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20110729150347.GF5458@xxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: Advocap Inc
References: <201107251418.21569.johnm@xxxxxxxxxxx> <20110729150347.GF5458@xxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: KMail/1.13.5 (Linux/2.6.32-5-amd64; KDE/4.4.5; x86_64; ; )
On Friday, July 29, 2011 10:03:48 am Konrad Rzeszutek Wilk wrote:
> On Mon, Jul 25, 2011 at 02:18:21PM -0500, John McMonagle wrote:
> > Have a new amd 6100 based server.
> > http://www.supermicro.com/Aplus/system/2U/2022/AS-2022G-URF.cfm
> > Running debian squeeze with debian 2.6.32 xen kernel
> > Running xen 4.1.1 built from source from xen.org
> > 
> > I'm seeing 2 errors.
> > during boot get this:
> > 
> > [    0.004823] ------------[ cut here ]------------
> > [    0.004833] WARNING:
> > at
> > /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/b
> > uild/source_amd64_xen/arch/x86/xen/enlighten.c:726
> > init_hw_perf_events+0x32d/0x3cd()
> > [    0.004838] Hardware name: H8DGU
> > [    0.004841] Modules linked in:
> > [    0.004847] Pid: 0, comm: swapper Not tainted 2.6.32-5-xen-amd64 #1
> > [    0.004850] Call Trace:
> > [    0.004857]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [    0.004862]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [    0.004870]  [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> > [    0.004875]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> > [    0.004881]  [<ffffffff813044dc>] ? identify_cpu+0x2f7/0x300
> > [    0.004888]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [    0.004895]  [<ffffffff810e81d5>] ? kmem_cache_alloc+0x8c/0xf0
> > [    0.004900]  [<ffffffff81510a16>] ? identify_boot_cpu+0x15/0x3e
> > [    0.004904]  [<ffffffff81510baa>] ? check_bugs+0x9/0x2e
> > [    0.004910]  [<ffffffff81509cce>] ? start_kernel+0x3cd/0x3e8
> > [    0.004915]  [<ffffffff8150bc93>] ? xen_start_kernel+0x586/0x58a
> 
> You can ignore that one. It just means that you can't do profiling which we
> haven't yet up-ported.
> 
> ..
> 
> > Then next one may not be xen but I only had the problem after running a
> > domu. After a while I get kernel error and networking stops.
> 
> And some other user with a bnx2 driver seems to see a similar problem. Let
> me CC them here.
> 
> > This is the error:
> > [ 1411.813376] ------------[ cut here ]------------
> > [ 1411.813398] WARNING:
> > at
> > /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/b
> > uild/source_amd64_xen/net/sched/s ch_generic.c:261
> > dev_watchdog+0xe2/0x194()
> 
> OK, this is one is more worrysome.
> 
> > [ 1411.813410] Hardware name: H8DGU
> > [ 1411.813417] NETDEV WATCHDOG: peth0 (igb): transmit queue 1 timed out
> > [ 1411.813424] Modules linked in: xt_physdev iptable_filter tun ip_tables
> > x_tables bridge stp sg sr_mod cdrom xfs exportfs ipmi_si i
> > pmi_devintf ipmi_watchdog ipmi_msghandler xen_evtchn blktap xenfs loop
> > snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr psmouse joydev
> > evdev serio_raw i2c_piix
> > 4 edac_core k10temp edac_mce_amd i2c_core processor button acpi_processor
> > ext4 mbcache jbd2 crc16 usbhid hid dm_mod raid1 md_mod sd_mod crc_t10dif
> > ata_generic usb_s
> > torage pata_atiixp ahci ohci_hcd libata ehci_hcd usbcore nls_base
> > scsi_mod igb dca thermal thermal_sys [last unloaded: scsi_wait_scan]
> > 
> >  [ 1411.813656] Pid: 4, comm: ksoftirqd/0 Tainted: G        W
> > 
> > 2.6.32-5-xen-amd64 #1
> > [ 1411.813664] Call Trace:
> > [ 1411.813671]  <IRQ>  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813697]  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813711]  [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> > [ 1411.813724]  [<ffffffff81272d60>] ? dev_watchdog+0x0/0x194
> > [ 1411.813736]  [<ffffffff8104ef88>] ? warn_slowpath_fmt+0x51/0x59
> > [ 1411.813751]  [<ffffffff8130d42a>] ? _spin_unlock_irqrestore+0xd/0xe
> > [ 1411.813762]  [<ffffffff8104b41e>] ? try_to_wake_up+0x289/0x29b
> > [ 1411.813778]  [<ffffffff81272d34>] ? netif_tx_lock+0x3d/0x69
> > [ 1411.813791]  [<ffffffff8125d7da>] ? netdev_drivername+0x3b/0x40
> > [ 1411.813803]  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> > [ 1411.813816]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
> > [ 1411.813827]  [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268
> > [ 1411.813841]  [<ffffffff8105b5ef>] ? run_timer_softirq+0x1c9/0x268
> > [ 1411.813855]  [<ffffffff81054c9b>] ? __do_softirq+0xdd/0x1a6
> > [ 1411.813867]  [<ffffffff81012cac>] ? call_softirq+0x1c/0x30
> > [ 1411.813873]  <EOI>  [<ffffffff8101422b>] ? do_softirq+0x3f/0x7c
> > [ 1411.813893]  [<ffffffff810548c2>] ? ksoftirqd+0x5f/0xd3
> > [ 1411.813905]  [<ffffffff81054863>] ? ksoftirqd+0x0/0xd3
> > [ 1411.813915]  [<ffffffff81065c39>] ? kthread+0x79/0x81
> > [ 1411.813926]  [<ffffffff81012baa>] ? child_rip+0xa/0x20
> > [ 1411.813937]  [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
> > [ 1411.813948]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
> > [ 1411.813958]  [<ffffffff81012ba0>] ? child_rip+0x0/0x20
> > [ 1411.813966] ---[ end trace a7919e7f17c0a727 ]---
> > [ 1412.052253] eth0: port 1(peth0) entering disabled state
> > [ 1635.796207] frontend_changed: backend/vbd/3/768: prepare for reconnect
> > [ 1647.137513] eth0: port 3(vif3.0) entering disabled state
> > [ 1647.157527] eth0: port 3(vif3.0) entering disabled state
> > 
> >  Kernel logging (proc) stopped.
> > 
> > In this case dom0 locked up. Some times just networking stops and some
> > times networking recovers.
> > 
> > Looks like it uses msi-x interrupts.
> > 
> > Concerning igb error I have tried the following  one at a time:
> > New igb driver from Intel site.
> > kernel parameter  pcie_aspm=off
> > ethtool -K eth0 tx off  on dom0
> > ethtool -K eth0 gro off  on dom0
> 
> OK.
> 
> > It has never died doing iperf from dom0 or domu  <> external.
> > Never died during network backup.
> > 
> > Usually takes a least a few hours and has never made it a day running a
> > domu. Wish I could get it to die faster :-)
> > Any ideas?
> > I'm pretty much down to trying different network cards
> 
> Did you try that? Did that make any difference?

Not tested I did install one.

I think I found a way to keep it running.
On the new igb driver I built from new intel source added module parameter 
IntMode=1.

This puts it in msi mode. It was in msi-x mode.
It's never died with that setting.
It's up now over a day.
No real experience with msi-x. I think it's the first time I have seen a 
driver use msi-x interrupts.
Maybe that gives you more ideas?


> 
> > Any ideas?
> 
> There is a Xen parameter called 'noirqbalance' . Try that. Also see if you
> can limit the CPUs in the dom0 using these two arguments on Xen
> hypervisor:
> 
Should I turn off the irqbalence daemon also?
Just in case you wonder it does with out it.

> dom0_vcpus=2 dom0_vcpus_pin=1
> 
> 
> It would be interesting to narrow down _when_ you trigger this failure. B/c
> we can pull Xen to see what the MSI's are 'xl debug-keys M' _before_ and
> _after_ your failure to see if something is amiss.
> 
> Mainly to figure out if the vectors are moving around the CPUs (or not)
> 
> (XEN)  MSI    29 vec=21 lowest  edge   assert  log lowest dest=00000001
> mask=0/0/-1
> 
> and also 'xl debug-keys i' to see if the domain has ACK-ed the interrupt:
> (XEN)    IRQ:  29 affinity:00000000,00000000,00000000,00000001 vec:21
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:275(----),
> 
> (the last '----' might have something else in in them - if so that is a
> sign that dom0 hasn't picked up the event/vector).

Much of my frustration is that I have not found a way to get it to fail other 
than waiting a long time :-(

Thanks

John



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>