[Xen-users] xen, iscsi and resilience to short network outages

To:	xen-users@xxxxxxxxxxxxxxxxxxx
Subject:	[Xen-users] xen, iscsi and resilience to short network outages
From:	"Steve Feehan" <sfeehan@xxxxxxxxx>
Date:	Thu, 9 Nov 2006 14:41:18 -0500
Delivery-date:	Thu, 09 Nov 2006 11:42:00 -0800
Domainkey-signature:	a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=ZxS6Jw/g0QRwjFPyCBewUAc8wBu/RLPHpvNp6OoXNlLKTZgIibtxLOj2CMxeEmgIc/S4tlYGwPOhjeh6AzZy8SQmSTY2VB+QzA0yt7ro5xCHiaJgMlZKQQnS8hL7d9tq63tnWIPFAYFJea9tghWa4e15blSJtksAjZWpbGWiYlU=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-users-request@lists.xensource.com?subject=help>
List-id:	Xen user discussion <xen-users.lists.xensource.com>
List-post:	<mailto:xen-users@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-users-bounces@xxxxxxxxxxxxxxxxxxx

Hi. Here is the short version:

If dom0 experiences a short (< 120 second) network outage the guests
whose disks are on iSCSI LUNs get (seemingly) unrecoverable IO errors.
Is it possible to make Xen more resiliant to such problems?

And now the full version:

We're testing Xen on iSCSI LUNs. The hardware/software configuration is:

 * Dom0 and guest OS:  SLES10 x86_64
 * iSCSI LUN on NetApp filer

We connect to the LUN through dom0 and then "map" the device to a guest
like so:

disk = [ 'phy:/dev/disk/by-id/scsi-360a9800043346863483437714a643833,hda,w' ]

At around noon today (though it's happend a few times in the last few weeks) one
of our switches was powered off. At that time, here is what I see in
syslog of dom0:

Nov  9 12:06:04 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:13 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:21 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:29 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:38 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:46 egovxen1 iscsid: connect failed (113)
Nov  9 12:06:55 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:03 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:04 egovxen1 kernel: tg3: peth0: Link is up at 1000 Mbps,
full duplex.
Nov  9 12:07:04 egovxen1 kernel: tg3: peth0: Flow control is off for
TX and off for RX.
Nov  9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering learning state
Nov  9 12:07:04 egovxen1 kernel: xenbr0: topology change detected, propagating
Nov  9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering forwarding state
Nov  9 12:07:10 egovxen1 kernel:  session0: iscsi: session recovery
timed out after 120 secs
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: scsi: Device offlined -
not ready after error recovery
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: scsi: Device offlined -
not ready after error recovery
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: SCSI error: return code = 0x20000
Nov  9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdd,
sector 23349467
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: SCSI error: return code = 0x20000
Nov  9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdc, sector 6573193
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov  9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov  9 12:07:11 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:20 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:28 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:36 egovxen1 iscsid: connect failed (113)
Nov  9 12:07:44 egovxen1 iscsid: connection0:0 is operational after
recovery (19 attempts)

So it looks like the iSCSI connection was dropped at 12:06:04 and reestablished
at 12:07:44. But during this time the guests who's disks were on the iSCSI LUNs
get IO errors and do not recover. Here is what I got when I connected
to the console:

sfeehan@egovxen1:~> sudo xm console xenlb2
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: Id "1" respawning too fast: disabled for 5 minutes

Is it possible to adjust a timeout or otherwise make Xen a bit more tolerant of
short network outages? Thanks.

--
Steve Feehan

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

WARNING - OLD ARCHIVES

xen-users

[Xen-users] xen, iscsi and resilience to short network outages