|
|
|
|
|
|
|
|
|
|
xen-users
[Xen-users] xen, iscsi and resilience to short network outages
Hi. Here is the short version:
If dom0 experiences a short (< 120 second) network outage the guests
whose disks are on iSCSI LUNs get (seemingly) unrecoverable IO errors.
Is it possible to make Xen more resiliant to such problems?
And now the full version:
We're testing Xen on iSCSI LUNs. The hardware/software configuration is:
* Dom0 and guest OS: SLES10 x86_64
* iSCSI LUN on NetApp filer
We connect to the LUN through dom0 and then "map" the device to a guest
like so:
disk = [ 'phy:/dev/disk/by-id/scsi-360a9800043346863483437714a643833,hda,w' ]
At around noon today (though it's happend a few times in the last few weeks) one
of our switches was powered off. At that time, here is what I see in
syslog of dom0:
Nov 9 12:06:04 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:13 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:21 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:29 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:38 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:46 egovxen1 iscsid: connect failed (113)
Nov 9 12:06:55 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:03 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:04 egovxen1 kernel: tg3: peth0: Link is up at 1000 Mbps,
full duplex.
Nov 9 12:07:04 egovxen1 kernel: tg3: peth0: Flow control is off for
TX and off for RX.
Nov 9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering learning state
Nov 9 12:07:04 egovxen1 kernel: xenbr0: topology change detected, propagating
Nov 9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering forwarding state
Nov 9 12:07:10 egovxen1 kernel: session0: iscsi: session recovery
timed out after 120 secs
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: scsi: Device offlined -
not ready after error recovery
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: scsi: Device offlined -
not ready after error recovery
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: SCSI error: return code = 0x20000
Nov 9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdd,
sector 23349467
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: SCSI error: return code = 0x20000
Nov 9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdc, sector 6573193
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device
Nov 9 12:07:11 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:20 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:28 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:36 egovxen1 iscsid: connect failed (113)
Nov 9 12:07:44 egovxen1 iscsid: connection0:0 is operational after
recovery (19 attempts)
So it looks like the iSCSI connection was dropped at 12:06:04 and reestablished
at 12:07:44. But during this time the guests who's disks were on the iSCSI LUNs
get IO errors and do not recover. Here is what I got when I connected
to the console:
sfeehan@egovxen1:~> sudo xm console xenlb2
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: cannot execute "/sbin/mingetty"
INIT: Id "1" respawning too fast: disabled for 5 minutes
Is it possible to adjust a timeout or otherwise make Xen a bit more tolerant of
short network outages? Thanks.
--
Steve Feehan
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Xen-users] xen, iscsi and resilience to short network outages,
Steve Feehan <=
- Re: [Xen-users] xen, iscsi and resilience to short network outages, Steven Smith
- Re: [Xen-users] xen, iscsi and resilience to short network outages, Steve Feehan
- Re: [Xen-users] xen, iscsi and resilience to short network outages, Steve Feehan
- Re: [Xen-users] xen, iscsi and resilience to short network outages, John Madden
- Re: [Xen-users] xen, iscsi and resilience to short network outages, Steve Feehan
- Re: [Xen-users] xen, iscsi and resilience to short network outages, John Madden
- Message not available
- Re: [Xen-users] xen, iscsi and resilience to short network outages, Steve Feehan
|
|
|
|
|