|
|
|
|
|
|
|
|
|
|
xen-users
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux ker
Have you tried to use the MegaRAID monitor to see if you can
diagnose some hardware problem with the RAID? There is one
you can download and run on the linux dom0, there should be a monitor
you can get to from the BIOS as well.. those error messages look very
much like an actual hardware fault on the RAID array.
I have a lot of megasas raid both under SL5 and SL6 and have used them
as xen dom0 and kvm vm hosts without problems, several different versions
of xen.
Steve Timm
On Tue, 18 Oct 2011, David Della Vecchia wrote:
I've tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 (about
5 different versions in between). I'm currently running xen 4.1.1 release on
centos6 with M.A.Young's centos6 xen dom0 kernel. For some reason the raid
array freaks out and swaps to read-only mode for the entire virtual device
the hardware raid array provides. I've tried both raid 0 and raid1 (2 1tb
SCSI drives). I've had this issue in every xen install I've tried on this
box, no matter what kernel version (tried as new as 3.0.1 in debian wheezy)
or xen version (compiled and installed the unstable branch to test) i use.
The server was running stable and fine for about a week this time before
this:
[root@gibson ~]# df -h
-bash: /bin/df: Input/output error
[root@gibson ~]# w
-bash: /usr/bin/w: Input/output error
[root@gibson ~]# modinfo megasas_raid
-bash: /sbin/modinfo: Input/output error
part of the /var/log/messages:
Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:10 gibson kernel: megasas: reset successful
Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:21 gibson kernel: megasas: reset successful
Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512
cmd=2a retries=0
Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:21 gibson kernel: megasas: reset successful
Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:42 gibson kernel: megasas: reset successful
Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512
cmd=2a retries=0
Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:42 gibson kernel: megasas: reset successful
Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
[root@gibson ~]# ls -al /bin/
ls: cannot access /bin/ntfs-3g.secaudit: Input/output error
ls: cannot access /bin/ntfstruncate: Input/output error
ls: cannot access /bin/ntfsdump_logfile: Input/output error
ls: cannot access /bin/ntfsls: Input/output error
ls: cannot access /bin/ntfsdecrypt: Input/output error
ls: cannot access /bin/ntfs-3g.usermap: Input/output error
ls: cannot access /bin/ntfsmount: Input/output error
ls: cannot access /bin/ntfsfix: Input/output error
ls: cannot access /bin/ntfscluster: Input/output error
total 8192
dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 .
drwxr-xr-x. 29 root root 4096 Oct 17 12:34 ..
-rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute
-rwxr-xr-x 1 root root 27808 May 30 10:55 arch
lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk
-rwxr-xr-x 1 root root 26264 May 30 10:55 basename
-rwxr-xr-x 1 root root 943248 May 30 11:46 bash
-rwxr-xr-x 1 root root 51344 May 30 10:55 cat
-rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify
-rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate
-rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete
-rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec
-rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget
-rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset
-rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp
-rwxr-xr-x 1 root root 52472 May 30 10:55 chmod
-rwxr-xr-x 1 root root 57496 May 30 10:55 chown
-rwxr-xr-x 1 root root 122344 May 30 10:55 cp
-rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio
lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh
-rwxr-xr-x 1 root root 45472 May 30 10:55 cut
-rwxr-xr-x 1 root root 109896 Aug 18 2010 dash
-rwxr-xr-x 1 root root 59552 May 30 10:55 date
-rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets
-rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon
-rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor
-rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send
-rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen
-rwxr-xr-x 1 root root 54040 May 30 10:55 dd
-rwxr-xr-x 1 root root 70256 May 30 10:55 df
-rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg
lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname
lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname
-rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys
-rwxr-xr-x 1 root root 27648 May 30 10:55 echo
-rwxr-xr-x 2 root root 53352 Nov 11 2010 ed
-rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep
-rwxr-xr-x 1 root root 26368 May 30 10:55 env
lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi
-rwxr-xr-x 1 root root 24592 May 30 10:55 false
-rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep
-rwxr-xr-x 1 root root 238640 Nov 11 2010 find
-rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk
-rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext
-rwxr-xr-x 1 root root 110160 Aug 25 2010 grep
lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar
-rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip
-rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip
-rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname
-rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc
lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml ->
/sbin/iptables-multi
-rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode
-rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl
-rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill
-rwxr-xr-x 1 root root 26256 May 30 10:55 link
-rwxr-xr-x 1 root root 49568 May 30 10:55 ln
-rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys
-rwxr-xr-x 1 root root 30992 Jun 25 02:46 login
-rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g
-rwxr-xr-x 1 root root 111744 May 30 10:55 ls
-rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup
-rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys
lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx
-rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx
-rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir
-rwxr-xr-x 1 root root 32352 May 30 10:55 mknod
-rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp
-rwxr-xr-x 1 root root 41144 Jun 25 02:46 more
-rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount
-rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint
-rwxr-xr-x 1 root root 111536 May 30 10:55 mv
-rwxr-xr-x 1 root root 177360 Nov 12 2010 nano
-rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat
-rwxr-xr-x 1 root root 28816 May 30 10:55 nice
lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname
-rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g
-rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe
-?????????? ? ? ? ? ? ntfs-3g.secaudit
-?????????? ? ? ? ? ? ntfs-3g.usermap
-rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat
-rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck
-?????????? ? ? ? ? ? ntfscluster
-rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp
-?????????? ? ? ? ? ? ntfsdecrypt
-?????????? ? ? ? ? ? ntfsdump_logfile
-?????????? ? ? ? ? ? ntfsfix
-rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo
-?????????? ? ? ? ? ? ntfsls
-rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc
l?????????? ? ? ? ? ? ntfsmount
-rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove
-?????????? ? ? ? ? ? ntfstruncate
-rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe
-rwsr-xr-x 1 root root 41432 Nov 11 2010 ping
-rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6
-rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth
-rwxr-xr-x 1 root root 86776 Nov 11 2010 ps
-rwxr-xr-x 1 root root 31656 May 30 10:55 pwd
-rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw
-rwxr-xr-x 1 root root 40056 May 30 10:55 readlink
-rwxr-xr-x 2 root root 53352 Nov 11 2010 red
-rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init
-rwxr-xr-x 1 root root 57504 May 30 10:55 rm
-rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir
lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano
-rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm
lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi
lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi
-rwxr-xr-x 1 root root 72248 Aug 22 2010 sed
-rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont
-rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial
lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash
-rwxr-xr-x 1 root root 27880 May 30 10:55 sleep
-rwxr-xr-x 1 root root 99000 May 30 10:55 sort
-rwxr-xr-x 1 root root 65864 May 30 10:55 stty
-rwsr-xr-x 1 root root 36440 May 30 10:55 su
-rwxr-xr-x 1 root root 25464 May 30 10:55 sync
-rwxr-xr-x 1 root root 384920 Nov 11 2010 tar
-rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset
-rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh
-rwxr-xr-x 1 root root 51952 May 30 10:55 touch
-rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath
-rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6
-rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute
lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute
-rwxr-xr-x 1 root root 24592 May 30 10:55 true
-rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount
-rwxr-xr-x 1 root root 27808 May 30 10:55 uname
-rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start
-rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop
-rwxr-xr-x 1 root root 26264 May 30 10:55 unlink
-rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep
-rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi
lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi
lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname
-rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat
Here is the rough partition information for my main drive:
/boot primary ext3 1gb /dev/sda1
/dev/sda2 extended lvm pv 925gb
vg_gibson lvm-volumegroup 925gb
/ lv_root ext3 36gb
swap lv_swap 2gb
Server Specs:
Dell Poweredge R710
32GB ECC Unbuffered Ram
2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total)
2x 1TB WD SCSI Drives in Raid-1
Drive Nitty Gritty:
Product ID: WDC WD1002FBYS-0
Revision: 0C06
Size: 953344MB
Heres some more information about the raid controller also attained from the
raid controller config utility:
Product Name: PERC 6/i
Package: 6.2.0-0013
FW Version: 1.22.02-0612
BIOS Version: 2.04.00
CtrlR Version: 1.02-015B
Boot Block: 1.00.00.01-0011
Application & OS Specs:
CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel
Diagnostic Attempts and Results:
I've done a consistency check on the raid array and everything comes back as
clean and optimal. I've ran bad block checks, partition table corruption,
mbr corruption, everything i can think of. It all comes back as clean and
working fine. Because of these results i have not been able to force my
dedicated hosting company to replace any of the hardware. They are upgrading
the raid controller software as its about 1 minor version out of date just
to see if that could be the issue, i'll report back if that mysteriously
fixes it but i'm not holding my breath.
I've read somewhere that the 2.6.x kernels have an old version of the
megaraid_sas module that will cause problems but the version included in the
M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 version
that article recommends upgrading to so i'm really at a loss. Besides the
version being so new the problem described in that article (the kernel not
finding the drive at all on boot) is not the issue i'm having. It just
freaks out randomly (i'm sure its not really randomly, just appears that
way) and the OS swaps to read-only mode and the only way to reboot is
basically to push the button on the front of the box.
Please, if anyone can direct me towards a solution or at least down a path i
have yet to try i would greatly appreciate it. I'm at my wits end, i've been
fighting this mysterious monster for over a month now and it always seems to
strike right before i'm about to go live with my services (first time it
happened was right after i started adding customers to the box).
Thanks in advance,
David
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|
|
|
|
|