http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=856
Summary: disk failures during access on SATA drives
Product: Xen
Version: unstable
Platform: x86-64
URL: http://bugs.debian.org/406581
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: Hardware Support
AssignedTo: xen-bugs@xxxxxxxxxxxxxxxxxxx
ReportedBy: bugzilla.xensource.com@xxxxxxxxxxxxxxxxx
Our Xen test machine has two SATA controllers
01:05.0 Mass storage controller: Promise Technology, Inc. PDC20375 (SATA150
TX2plus) (rev 02)
01:08.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak
378/SATA 378) (rev 02)
and a total of three SATA drives (all SAMSUNG SP2004C) connected to
it. Two are connected to the SATA378 controller (the second one,
which is onboard), and the third is connected to the SATA150 one
(which is a PCI card). The system is an AMD Opteron, running etch
and native amd64. The three drives each hold 8 partitions, which are
turned into 8 RAID arrays, two RAID1 and 6 RAID5.
dmesg output right after boot can be found at http://bugs.debian.org/406581
along with lspci, cpuinfo and mdstat. Please contact me for more information. I
will be away from the system for the next couple of weeks, but it'll be running
the non-Xen kernel and be accessible, and if needed, I can get a colleague to
do work on it for you.
The problem occurs sporadically, but only when booting the Xen
kernel. I have not once managed to reproduce it with the
2.6.18-3-amd64 kernel. I can reproduce it with the
2.6.18-3-xen-amd64 kernel more or less at will.
It seems that disk activity triggers it. For instance, booting and
letting a RAID5 spanned across the three disks resynchronise almost
always causes the problem to appear. This is what the log says in
such a case:
kernel: ata3: command timeout
kernel: ata3: no sense translation for status: 0x40
kernel: ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
kernel: ata3: status=0x40 { DriveReady }
kernel: sd 2:0:0:0: SCSI error: return code = 0x08000002
kernel: sdb: Current: sense key: Aborted Command
kernel: Additional sense: No additional sense information
kernel: end_request: I/O error, dev sdb, sector 48044091
kernel: raid5:md4: read error not correctable (sector 41425248 on sdb7).
kernel: raid5: Disk failure on sdb7, disabling device. Operation continuing
on 1 devices
kernel: raid5:md4: read error not correctable (sector 41425256 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425264 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425272 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425280 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425288 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425296 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425304 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425312 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425320 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425328 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425336 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425344 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425352 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425360 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425368 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425376 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425384 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425392 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425400 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425408 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425416 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425424 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425432 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425440 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425448 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425456 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425464 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425472 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425480 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425488 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425496 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425504 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425512 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425520 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425528 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425536 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425544 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425552 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425560 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425568 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425576 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425584 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425592 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425600 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425608 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425616 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425624 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425632 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425640 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425648 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425656 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425664 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425672 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425680 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425688 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425696 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425704 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425712 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425720 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425728 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425736 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425744 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425752 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425760 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425768 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425776 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425784 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425792 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425800 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425808 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425816 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425824 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425832 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425840 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425848 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425856 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425864 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425872 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425880 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425888 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425896 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425904 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425912 on sdb7).
kernel: raid5:md4: read error not correctable (sector 41425920 on sdb7).
kernel: ata4: command timeout
kernel: ata4: no sense translation for status: 0x40
kernel: ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
kernel: ata4: status=0x40 { DriveReady }
kernel: sd 3:0:0:0: SCSI error: return code = 0x08000002
kernel: sdc: Current: sense key: Aborted Command
kernel: Additional sense: No additional sense information
kernel: end_request: I/O error, dev sdc, sector 48043483
kernel: raid5: Disk failure on sdc7, disabling device. Operation continuing
on 1 devices
Note that the disk and controller will change. Once it's ata3/4 and
sdb/c, at other times it's ata1/3 and sda/b. The disks themselves
have no SMART errors.
For instance, here's another instance:
kernel: ata4: command timeout
kernel: ata4: no sense translation for status: 0x40
kernel: ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
kernel: ata4: status=0x40 { DriveReady }
kernel: sd 3:0:0:0: SCSI error: return code = 0x08000002
kernel: sdc: Current: sense key: Aborted Command
kernel: Additional sense: No additional sense information
kernel: end_request: I/O error, dev sdc, sector 56772315
kernel: raid5: Disk failure on sdc7, disabling device. Operation continuing
on 2 devices
kernel: ata1: command timeout
kernel: ata1: no sense translation for status: 0x40
kernel: ata1: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
kernel: ata1: status=0x40 { DriveReady }
kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
kernel: sda: Current: sense key: Aborted Command
kernel: Additional sense: No additional sense information
kernel: end_request: I/O error, dev sda, sector 56772907
kernel: raid5:md4: read error not correctable (sector 50154064 on sda7).
kernel: raid5: Disk failure on sda7, disabling device. Operation continuing
on 1 devices
kernel: raid5:md4: read error not correctable (sector 50154072 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154080 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154088 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154096 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154104 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154112 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154120 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154128 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154136 on sda7).
kernel: raid5:md4: read error not correctable (sector 50154144 on sda7).
Following the above, other partitions will report failures and the
system will hardlock. Upon reboot, it's normal again (the RAID
recovery restarts), but no data seems to be lost.
See below for the list of modules at time of the crash. Note that
sata_nv is being loaded (by udev), but there are no additional SATA
ports other than the two on-board Promise ports and the two ports on
the PCI card. The sata_nv module can be freely removed.
Modules loaded:
Module Size Used by
bridge 63408 0
netloop 11392 0
tun 16256 0
ipv6 285920 18
ipt_MASQUERADE 8320 1
iptable_nat 12292 1
ipt_REJECT 10112 1
xt_tcpudp 7936 22
ipt_addrtype 6528 1
ipt_LOG 11264 1
xt_limit 7424 1
xt_conntrack 7168 6
ip_nat_ftp 8064 0
ip_nat 24492 3 ipt_MASQUERADE,iptable_nat,ip_nat_ftp
ip_conntrack_ftp 13136 1 ip_nat_ftp
ip_conntrack 63140 6
ipt_MASQUERADE,iptable_nat,xt_conntrack,ip_nat_ftp,ip_nat,ip_conntrack_ftp
nfnetlink 11976 2 ip_nat,ip_conntrack
iptable_filter 7808 1
ip_tables 25192 2 iptable_nat,iptable_filter
x_tables 21896 9
ipt_MASQUERADE,iptable_nat,ipt_REJECT,xt_tcpudp,ipt_addrtype,ipt_LOG,xt_limit,xt_conntrack,ip_tables
dm_crypt 16400 0
psmouse 44560 0
serio_raw 12036 0
i2c_nforce2 12544 0
pcspkr 7808 0
shpchp 42028 0
pci_hotplug 20872 1 shpchp
i2c_core 27776 1 i2c_nforce2
evdev 15360 0
ext3 138256 6
jbd 65392 1 ext3
mbcache 14216 1 ext3
dm_mirror 25344 0
dm_snapshot 20536 0
dm_mod 62928 5 dm_crypt,dm_mirror,dm_snapshot
raid456 123680 7
xor 11024 1 raid456
raid1 27136 2
md_mod 83484 11 raid456,raid1
ide_generic 5760 0 [permanent]
sd_mod 25856 27
ide_disk 20736 6
generic 10756 0 [permanent]
amd74xx 19504 0 [permanent]
ide_core 148224 4 ide_generic,ide_disk,generic,amd74xx
sata_promise 18052 24
tulip 57760 0
libata 107040 2 sata_promise
scsi_mod 153008 2 sd_mod,libata
ehci_hcd 36232 0
ohci_hcd 24964 0
fan 9864 0
I've run the following loop for several hours now on the non-Xen
kernel while resyncing the RAIDs, and it's still running, so I doubt
that the hardware is at fault:
while :; do
rsync -a --delete /home/ .; rsync -a --delete /var .;
rsync -a --delete /tmp .; rsync -a --delete /usr .;
done
Also, to follow up on waldi's reply, my personal amd64 machine runs
the PDC378 as well:
00:08.0 RAID bus controller: Promise Technology, Inc. PDC20378
(FastTrak 378/SATA 378) (rev 02)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420
SATA RAID Controller (rev 80)
and it also has a Xen kernel without stability problems. This
suggests that it's either the PDC150 or another hardware in the
Opteron system responsible for the problems.
It does seem SATA related, however, as the system never had
a problem resyncing an array of PATA disks. Only disk access to the
SATA disks would cause the failures.
--
Configure bugmail:
http://bugzilla.xensource.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
_______________________________________________
Xen-bugs mailing list
Xen-bugs@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-bugs
|