We have experienced recently few issues on Xen 3.3.1 for which we
would appreciate if one of you can shed some light.
First of all, our system configuration is:
- a dual Xeon 2.5GHz with 16Gb (8
cores)
- Xen 3.3.1 from latest
xensources distributed with Linux Kernel 2.6.18.8-xen
- Dom0 is a Centos 5.2 upgraded
few days ago to Centos 5.3
- There are 6 HVM DomUs running,
5 with sporadic issues (see below) are Fedora-10 x86_64 and 1 domU (no issue so
far) is a Windows 2003.
- The 5 Fedora-10 domUs have the
latest package upgrades, including a Linux kernel 2.6.27.19-170.2.35.fc10.x86_64.
They have 2 vCPU each, between 512MB to 1Gb of memory, and 30Gb of disk space
stored on an internal SATA
- Dom0’s VPCU is pinned to
core 0 (dom0_vcpus_pin)
- DomUs are visibly sharing
core 1 to 7, (xm vcpu_List) although no config was done to map them to specific
Cpu/cores
Now here are our observations:
(1) The Fedora-10 domUs described
above are randomly and partially (see below) freezing after running for some
hours.
- If there is a pre-existing ssh
session on a hung domU, some commands such as ‘ls’,
‘ps’, ‘tail –f <file>’,’free’
can be executed while commands such ‘top’, ‘vmstat’
will hang OR sometimes no command at all
- Xentop display of 0% activity
on a hung domU although I have observed a 100% once on another hung one
- There is nothing significant on
domU:/var/log/messages and nothing as well on dom0:/var/log/xen/qemu-dm-…
- Nagios running on dom0
doesn’t really picked this condition up as the hung domUs are still able
to answer ping or able to answer Nagios ssh checkin; note that ssh to a hung
domU doesn’t work although Nagios basic tcp port answers on 22
- Their time is completely off
(see next observation below) with or without ntpd running
- I had the occasion to run
‘free’ on few of them and it appears that they had enough free
memory, i.e. not swapping at all
ð I
don’t want to speculate on the potential root cause nevertheless what can
be the next most effective troubleshooting steps?
o Force a
domU system dump? And then?
o Deep
dive into dom0 logs although a quick browsing wasn’t successful?
o Disable
most of the processes on one of these domU to identify if a user proc can cause
this issue (may be very time consuming)?
o Set the
run-level to 3 instead of 5?
(2) The 5 Fedora-10s domUs are not
keeping their time in sync
We have read different pages concerning time management for a
Linux domU but we haven’t found yet something concluding and/or
haven’t been able to set this up properly. The facts are:
- Our dom0 runs ntpd and is
perfectly synchronized on external public ntp sources
- We tried initially to run ntpd
on the Fedora-10 domUs, configured on external public sources, which has proven
to be unsuccessful; the time is usually off by few minutes
- We tried without ntpd, this
should be the proper configuration according to our readings as the
domUs’ hardware clock should sync up on their dom0’s hw clock alas
still unsuccessful. In this case, the domUs end up significantly lagging behind
their dom0’s time
- We have read on few occasion
that there is a parameter to set with echo 1 >
/proc/sys/xen/independent_wallclock in order to run ntpd on a domU, but
/proc/sys/xen doesn’t exist on these Fedora-10 domUs. Is it an expected
behavior? Should we assume the setting independent_wallclock is only for PV
domUs?!
- Note that one of the domUs is a
Windows 2003 server 32-bits and is perfectly on time, i.e. in sync with its
dom0. It does run the default Windows time service, no ntpd installed
(3) The 5 Fedora-10 domUs have been
installed as HVM domU but their kernels see them as PV. This may be a
misunderstanding from our side, however, a dmesg on the 5 Fedora-10 domUs,
shows the message:
“Booting paravirtualized
kernel on bare hardware”
We just installed an HVM centos 5.3 domU, and this time the kernel
boot message “Booting …” doesn’t appear.
Therefore, can we conclude that the presumed HVM Fedora-10 domUs
are in fact PV domUs?
Should a /proc/sys/xen be present on a PV domU or on any type of
domUs?