A strange thing is that not a single non-xen machine went down.
I will set up two xen machines on a switch with a loop and see what I'll
From: James Harper [mailto:james.harper@xxxxxxxxxxxxxxxx]
Sent: Friday, September 19, 2008 3:22 PM
To: Moritz Möller; Ian Pratt; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
> We rebooted the machines really quickly because it was a productive
> system, so I didn't have the time to copy the logs, and on the disks I
> see nothing about this in the logfiles, propably because the IO was
> already down.
> The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
> GB RAM, and some have a mdraid setup with two SATA drives with the on
> board sata controller (intel ICH), other have a dedicated 3ware / AMCC
> 9660 or similar.
> The machines that crashed were on different power lines and connected
> different switches, although on the same network segment. Also there
> were no physical interferences.
> The error was reported by domU and dom0 - both saying the disk would
> give a I/O error, but no specific information.
> Network card is intel e1000.
The error wasn't a timeout was it? We had a similar problem under
Windows (no Xen involved at all) where the switch the server was plugged
into was looped back to itself one evening. Any broadcast packet sent to
the switch would just circulate around the switch indefinitely, until
there were enough broadcast packets looping around that everything
ground to a halt.
The server was a HP DL380, so a more than capable machine, but there
were enough interrupts occurring due to a completely saturated network
that everything was reporting timeouts. In this case the server didn't
require a reboot. It sat in that state the whole night, reporting disk
timeouts etc but the moment we rectified the cabling fault in the
morning it instantly bounced back to life.
It could be that Linux treats timeout errors a little more severely?
Can anyone say if the layer above blkfront in the Linux kernel will
report timeouts? Or would the errors have been coming through from Dom0?
Anyway, do you have a test environment you can reproduce the problem on?
If the problem is as simple as a looped switch then it shouldn't be too
hard to reproduce...
Xen-devel mailing list