RE: [Xen-devel] disk io errors possibly caused by high network l

To:	"Ian Pratt" <Ian.Pratt@xxxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject:	RE: [Xen-devel] disk io errors possibly caused by high network load?
From:	Moritz Möller <m.moeller@xxxxxxxxxxxx>
Date:	Fri, 19 Sep 2008 15:00:33 +0200
Cc:
Delivery-date:	Fri, 19 Sep 2008 06:01:38 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<122AA196D7CE4E4DBB92911EF4AB5AB3846045@xxxxxxxxxxxxxxxxxxxxxxxxxx> <DD74FBB8EE28D441903D56487861CD9D362C58F1@xxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AckaTGDzATGLlbxxRXaBZvbEdrbgFgACImAQAAB97RAAACyE4A==
Thread-topic:	[Xen-devel] disk io errors possibly caused by high network load?

Okay - wrong key. Message continued

-----Original Message-----
From: Moritz Möller 
Sent: Friday, September 19, 2008 3:00 PM
To: 'Ian Pratt'; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?

We rebooted the machines really quickly because it was a productive
system, so I didn't have the time to copy the logs, and on the disks I
see nothing about this in the logfiles, propably because the IO was
already down.

The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
GB RAM, and some have a mdraid setup with two SATA drives with the on
board sata controller (intel ICH), other have a dedicated 3ware / AMCC
9660 or similar.

The machines that crashed were on different power lines and connected to
different switches, although on the same network segment. Also there
were no physical interferences.

The error was reported by domU and dom0 - both saying the local disk
(either sda or sdb on mdraid systems, and sda on raid systems) reports a
I/O error, but no specific information.

Network card is intel e1000.

Lsmod:

nfs                   257112  1
w83792d                39320  0
w83781d                44840  0
i2c_isa                14720  1 w83781d
w83793                 46360  0
hwmon_vid              11264  2 w83781d,w83793
hwmon                  12040  3 w83792d,w83781d,w83793
ipmi_devintf           20112  0
ipmi_si                52812  0
ipmi_msghandler        47096  2 ipmi_devintf,ipmi_si
nls_utf8               10624  3
cifs                  228112  3
xt_physdev             11152  4
iptable_filter         11392  1
ip_tables              28648  1 iptable_filter
x_tables               29064  2 xt_physdev,ip_tables
ipv6                  339072  22
bridge                 64936  0
8021q                  29584  0
nfsd                  263848  1
exportfs               14336  1 nfsd
lockd                  74800  2 nfs,nfsd
nfs_acl                12160  2 nfs,nfsd
sunrpc                186344  5 nfs,nfsd,lockd,nfs_acl
blkbk                  30776  0 [permanent]
netbk                 105184  0 [permanent]
loop                   26768  0
8250_pnp               19968  0
sg                     45224  0
sr_mod                 26148  0
cdrom                  44072  1 sr_mod
i2c_i801               17052  0
iTCO_wdt               20432  0
iTCO_vendor_support    12548  1 iTCO_wdt
8250                   50120  1 8250_pnp
serial_core            31616  1 8250
i2c_core               32256  5 w83792d,w83781d,i2c_isa,w83793,i2c_i801
serio_raw              16004  0
pcspkr                 11776  0
joydev                 19584  0
ext3                  141200  2
jbd                    72432  1 ext3
mbcache                18184  1 ext3
dm_mirror              30528  0
dm_snapshot            25416  0
dm_mod                 69520  21 dm_mirror,dm_snapshot
raid1                  32768  3
sd_mod                 35200  8
usb_storage            90304  0
ata_piix               25092  6
ata_generic            17412  0
floppy                 68904  0
ehci_hcd               41100  0
uhci_hcd               32544  0
libata                126896  2 ata_piix,ata_generic
scsi_mod              166968  5 sg,sr_mod,sd_mod,usb_storage,libata
e1000                 130880  0
xenbus_be              12800  2 blkbk,netbk
xennet                 37512  0
xenblk                 26720  0




-----Original Message-----
From: Ian Pratt [mailto:Ian.Pratt@xxxxxxxxxxxxx] 
Sent: Friday, September 19, 2008 2:44 PM
To: Moritz Möller; xen-devel@xxxxxxxxxxxxxxxxxxx
Cc: Ian Pratt
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?

> we had a very strange situation yesterday. In one second, 13 of 25 xen
> boxes died with disk errors (domU and dom0, something like
end_request:
> I/O error dev hda sector ...), but worked well again after a reboot.
> 
> Some minutes before a technician plugged in a wrong cable, creating a
> network loop - so the error could be caused by a high network io load.
> The disks are okay, and the error occurred with both scsi raid
> controllers and plain sata disks.

This is quite remarkable -- I don't think anyone has reported anything
similar before, despite there being many large xen deployments.

Are you saying that IO errors were reported from both dom0 and the
domU's? 

Did you actually track down the specific device major/minor that was
reporting the error?

Is there any network storage (e.g. iSCSI, AOE) in your setup?

Ian 

> Here is some info of a host that crashed:
> 
> root/mmoeller@srv002050:/root$ xm info
> host                   : srv002050
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 8
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 1866
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 12
> node_to_cpu            : node0:0-7
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv002050:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> And here of a host that did not crash:
> 
> root/mmoeller@srv006215:/root$ xm info
> host                   : srv006215
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 4
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 2394
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 10
> node_to_cpu            : node0:0-3
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv006215:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> Does someone have an idea how this could happen?
> 
> 
> Thanks,
> 
> 
> Moritz
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] disk io errors possibly caused by high network load?