WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] blocking Xen 3.X production use: soft lockup bugs

To: Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx>
Subject: Re: [Xen-devel] blocking Xen 3.X production use: soft lockup bugs
From: Steve Traugott <stevegt@xxxxxxxxxxxxx>
Date: Wed, 2 Aug 2006 17:27:01 -0700
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Wed, 02 Aug 2006 17:33:37 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <A95E2296287EAD4EB592B5DEEFCE0E9D572305@xxxxxxxxxxxxxxxxxxxxxxxxxxx>; from m+Ian.Pratt@xxxxxxxxxxxx on Wed, Aug 02, 2006 at 11:25:45PM +0100
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <A95E2296287EAD4EB592B5DEEFCE0E9D572305@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.2.5i
Hi Ian,

Thanks for your patience...

On Wed, Aug 02, 2006 at 11:25:45PM +0100, Ian Pratt wrote:
> > The problem (or something that looks identical) is described in
> > several tickets, status currently NEW or REOPENED, no clear
> > resolution:
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705
> 
> There's very little to go on here. Two of the bugs are actually the same
> guy. One of the others is x86_64 the other two are 32b. 

That's what I was starting to realize -- a lot of folks (including me)
have been classing all soft lockups together, without digging deeper.

> The only thing in common about the stack traces is that networking
> functions seem to feature.

Might be the same in my case; see the stack trace in my message in
this thread a few minutes ago, copied below.  The 'isconf' process you
see there does a lot of UDP and TCP traffic for file transfers, as
well as moderate disk I/O.

> Taking a wild guess, are you doing some kind of unusual networking setup
> involving iptables rules?

Nope.  Right now I can't think of anything I'm doing that's not the
standard Xen bridging setup.

> > Do we have any consensus that this bug is fixed at all in
> > xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
> > lockups in testing *not* hitting them any more on the same hardware?
> > If so, what changeset are you on now?
> 
> Soft lockups could be due to a huge variety of causes. It's unlikely to
> be a hardware issue, and since the problems seem to be experienced by a
> very small number of users my guess would be that it's configuration
> dependent, most likely networking.
> 
> > If anyone needs any more information, just let me know.  As usual, if
> > anyone wants login and console server access to one of these boxes to
> > chase this down, I'm more than happy to provide that.
> 
> Having a really detailed bug report would really be the best way of
> proceeding.

This is why I was thinking about starting a "how to report soft
lockups" wiki page; I think we haven't been giving you enough.  Is
there already a more generic Xen bug reporting howto somewhere, or
should I have at it, using your questions below as a start? 

> When this happens, does it just effect one guest? 

We typically see error messages on only one guest's console, but other
guests and dom0 tend to lock up for ~30 seconds as well.

> What's the stack trace? 

See the dmesg below (this is the same one I just posted a few minutes
ago, in my previous message, copying here for reference).

> How many VCPUs has the guest got? 

One.  So far I've seen soft lockups with and without nosmp on the Xen
command line on our Netengines, but can't yet tell you if they were
the same stack trace.  Haven't tried nosmp on the x330's yet, am about
to.

> Is the guest completely hosed or is it still pingable? 

Tends to be unpingable for ~30 seconds, usually recovers, but
sometimes corrupts filesystem to a state which is unrecoverable (first
machine it destroyed was our primary KDC, ouch...)

> What about guest console echo? 

Works, but latency is on the order of several seconds or more for ~30
seconds.

> What about 'xm sysreq'? 

Unable to answer this yet due to that high dom0 latency.

> Looking in dom0, are you still seeing packets go to/from the
> associated VIF? 

Unable to answer this yet due to that high dom0 latency.

> How many network interfaces has the guest got? 

Only eth0 and lo.

> What's the precise networking setup in dom0? 

Standard Xen bridging config:

n4h34:~# ifconfig | grep encap
eth0      Link encap:Ethernet  HWaddr 00:02:55:C7:CA:D8  
lo        Link encap:Local Loopback  
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
vif1.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
vif5.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
n4h34:~# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
n4h34:~# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.27.4.0       *               255.255.255.0   U     0      0        0 eth0
default         10.27.4.254     0.0.0.0         UG    0      0        0 eth0

> Can you come up with a recipe for reproduction, ideally with a
> single guest? 

It looks like it can be reliably produced by starting a second guest
and doing a mix of steady network and disk I/O -- isconf, for
instance, runs during rc and updates the local disk image by pulling
new packages over the network and installing them on the fly
(http://trac.t7a.org/isconf/), so it is usually the first to trigger
the bug in our environment.  

I haven't seen the bug as often with only one guest.  For example, I
built an AFS server in domain 1 on this same x330, generating lots of
disk and network I/O in the process, ran it for days with no problems,
then tried to start a copy of the same base image up as domain 2 on
the same box and got the dmesg you see here; only the hostname, IP,
MAC etc. were different.  

I'll see if I can come up with a simple python script or something
which can trigger it.

The only other "unusual" thing I can think of about this configuration
is that it's using DRBD on top of EVMS in dom0 for the guest volumes;
this would also increase dom0 network traffic during any guest disk
I/O.  I hope to heck this doesn't turn out to be a DRBD
incompatibility; we've used DRBD with Xen since the early 2.X days,
and it's been solid.  I'll have to do some testing to see if I can
eliminate DRBD as a factor.

Steve


n4h34:~# xm create -c /etc/xen/auto/build2.t7a.org
Using config file "/etc/xen/auto/build2.t7a.org".
Started domain build2.t7a.org
Linux version 2.6.16.13-xen (root@n4h33) (gcc version 3.3.5 (Debian 
1:3.3.5-12)) #2 SMP Sun Jun 11 14:25:16 PDT 2006
BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 0000000008000000 (usable)
0MB HIGHMEM available.
136MB LOWMEM available.
ACPI in unprivileged domain disabled
IRQ lockup detection disabled
Built 1 zonelists
Kernel command line:  root=/dev/sda1 2
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 16384 bytes)
Xen reported: 1130.113 MHz processor.
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Software IO TLB disabled
vmalloc area: c9000000-fb7fe000, maxmem 33ffe000
Memory: 114612k/139264k available (3368k kernel code, 16308k reserved, 1033k 
data, 196k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 2261.96 BogoMIPS (lpj=11309833)
Security Framework v1.0.0 initialized
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Checking 'hlt' instruction... OK.
Brought up 1 CPUs
migration_cost=0
checking if image is initramfs... it is
Freeing initrd memory: 9535k freed
Grant table initialized
NET: Registered protocol family 16
Brought up 1 CPUs
PCI: setting up Xen PCI frontend stub
ACPI: Subsystem revision 20060127
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
xen_mem: Initialising balloon driver.
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: System does not support PCI
PCI: System does not support PCI
IA-32 Microcode Update Driver: v1.14-xen <tigran@xxxxxxxxxxx>
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
JFS: nTxBlock = 1024, nTxLock = 8192
SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug 
enabled
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
PNP: No PS/2 controller found. Probing ports directly.
i8042.c: No controller found.
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Xen virtual console successfully installed as tty1
Event-channel device installed.
blkif_init: reqs=64, pages=704, mmap_vstart=0xc7400000
netfront: Initialising virtual ethernet driver.
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx
Registering block device major 8
ide-floppy driver 0.99.newide
Fusion MPT base driver 3.03.07
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.07
Fusion MPT misc device (ioctl) driver 3.03.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)
usbmon: debugfs is not available
usbcore: registered new driver libusual
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
NET: Registered protocol family 2
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
TCP established hash table entries: 8192 (order: 4, 65536 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
TCP: Hash tables configured (established 8192 bind 8192)
TCP reno registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 8
NET: Registered protocol family 20
Using IPI No-Shortcut mode
Freeing unused kernel memory: 196k freed
Loading, please wait...
Begin: Loading essential drivers... ...
tg3: no version for "struct_module" found: kernel tainted.
eepro100.c:v1.09j-t 9/29/99 Donald Becker 
http://www.scyld.com/network/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin 
<saw@xxxxxxxxxxxxx> and others
Intel(R) PRO/1000 Network Driver - version 6.3.9-k4
Copyright (c) 1999-2005 Intel Corporation.
Done.
Begin: Running /scripts/init-premount ...
FATAL: Error inserting fan 
(/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/fan.ko): No such device
FATAL: Error inserting thermal 
(/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/thermal.ko): No such device
Done.
Begin: Mounting root file system... ...
Begin: Running /scripts/local-top ...
Done.
Begin: Running /scripts/local-premount ...
Done.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
Begin: Running /scripts/log-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
Done.
mount: Mounting /sys on /root/sys failed: No such file or directory
INIT: version 2.85 booting
Activating swap.
Checking root file system...
fsck 1.39 (29-May-2006)
/dev/sda1: clean, 21526/917504 files, 245920/1835007 blocks
EXT3 FS on sda1, internal journal
System time was Wed Aug  2 22:17:34 UTC 2006.
Setting the System Clock using the Hardware Clock as reference...
System Clock set. System local time is now Wed Aug  2 22:17:37 UTC 2006.
Loading device-mapper support.
Checking all file systems...
fsck 1.39 (29-May-2006)
Setting kernel variables..
Mounting local filesystems...
Adding 524280k swap on /swap00.  Priority:-1 extents:134 across:533176k
Cleaning /tmp /var/run /var/lock.
Running 0dns-down to make sure resolv.conf is ok...done.
Cleaning: /etc/network/ifstate.
Setting up IP spoofing protection: rp_filter.
Configuring network interfaces...done.
Loading the saved-state of the serial devices...
/dev/ttyS0: No such file or directory
/dev/ttyS0: No such file or directory
/dev/ttyS1: No such file or directory
/dev/ttyS1: No such file or directory
Not setting System Clock
Initializing random number generator...done.
Recovering nvi editor sessions... done.
INIT: Entering runlevel: 2
Starting isconf daemonRunning isconf updateisconf: info: build2.t7a.org is on 
guest-1 branch
isconf: info: may reboot...
isconf: info: checking for updates
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.911958506882
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.999292957677
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.239902520967
BUG: soft lockup detected on CPU#0!

Pid: 2383, comm:               isconf
EIP: 0073:[<080c9763>] CPU: 0
EIP is at 0x80c9763
 ESP: 007b:bfcc962c EFLAGS: 00200282    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000001 EBX: 0000003a ECX: bfcc9624 EDX: 00000000
ESI: 08137cb4 EDI: 00000001 EBP: bfcc9638 DS: 007b ES: 007b
CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
isconf: info: fetching 
http://10.27.4.34:65028/t7a.org/block/ff1/ff1276f7811aeeade18d54a6c3578261ff36ecbb-4fb47b36cda57ae95af56372f03bb2ca-1?challenge=0.265409462016
isconf: info: updated /etc/ldap/ldap.conf
BUG: soft lockup detected on CPU#0!

Pid: 2383, comm:               isconf
EIP: 0073:[<080af84d>] CPU: 0
EIP is at 0x80af84d
 ESP: 007b:bfcc96d0 EFLAGS: 00200246    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000001 EBX: 082031fe ECX: 082031fe EDX: b7af1f8c
ESI: 00000000 EDI: 082030ec EBP: bfcc9838 DS: 007b ES: 007b
CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/c0e/c0e10bc50572deb89da6e9d96ac5971a39fddc65-fc3558eaffc90497248f97f9b0e3a924-1?challenge=0.130730726051
isconf: info: updated /etc/ca-certificates.conf
isconf: info: running ['update-ca-certificates']
Updating certificates in /etc/ssl/certs....done.
isconf: info: updated /etc/ldap/ldap.conf
BUG: soft lockup detected on CPU#0!

Pid: 1, comm:                 init
EIP: 0061:[<c0322fe1>] CPU: 0
EIP is at netif_poll+0x101/0x810
 EFLAGS: 00000216    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000037 EBX: c0945180 ECX: 0001134e EDX: c0945000
ESI: c0f48280 EDI: c0f499e8 EBP: c09451c0 DS: 007b ES: 007b
CR0: 8005003b CR2: b7d579e0 CR3: 0057e000 CR4: 00000640
 [<c03d891a>] net_rx_action+0xea/0x230
 [<c0124cb5>] __do_softirq+0xf5/0x120
 [<c0124d75>] do_softirq+0x95/0xa0
 [<c0106c0f>] do_IRQ+0x1f/0x30
 [<c0312f58>] evtchn_do_upcall+0xa8/0xf0
 [<c0105178>] hypervisor_callback+0x2c/0x34
 [<c02c2081>] __copy_user_intel+0x31/0xb0
 [<c02c2220>] __copy_to_user_ll+0x70/0x80
 [<c02c22f2>] copy_to_user+0x42/0x60
 [<c0171068>] cp_new_stat64+0xf8/0x110
 [<c01710b7>] sys_stat64+0x37/0x40
 [<c0104fb5>] syscall_call+0x7/0xb
isconf: warning: clierr:  Connection reset by peer
Starting system log daemon: syslogd.
Starting kernel log daemon: klogd.
No configuration file was found for slapd at /etc/ldap/slapd.conf.
If you have moved the slapd configuration file please modify
/etc/default/slapd to reflect this.  If you chose to not
configure slapd during installation then you need to do so
prior to attempting to start slapd.
An example slapd.conf is in /usr/share/slapd
Starting Heimdal KDC: heimdal-kdc.
Starting Heimdal password server: kpasswdd.
Starting internet superserver: inetd.
Starting PCMCIA services: module directory /lib/modules/2.6.16.13-xen/pcmcia 
not found.
Starting OpenBSD Secure Shell server: sshd.
Starting deferred execution scheduler: atd.
Starting periodic command scheduler: cron.

Debian GNU/Linux testing/unstable build2.t7a.org tty1

build2.t7a.org login:



-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@xxxxxxxxxxxxx 
http://www.stevegt.com -- http://Infrastructures.Org

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel