We are using XEN as hypervisor to setup our private cloud.
The framework is Eucalyptus and using CentOS 5.4 as dom0 OS.
Sometimes we find some machines' dom0 become unresponsive, the symptoms are:
(1) We can't log into dom0 via ssh. After typing password, it just stops there.
(2) We can ping dom0 successfully.
(3) We can log into domU without problem.
The unresponsive dom0 eventually "alive" after a period of time. Maybe half hour or even several hours.
Then we can log into dom0 without problem. And everything works fine except some weird things like:
(1) Some daemons stop logging during unresponsive period. The log file has a gap.
(2) daemon is dead during the unresponsive period.
We can't find any suspicious log on system log (system log doesn't log during the period, either).
Also I redirect the console to com1, turn xen loglvl to all. There are no logs during the period either.
I can switch to xen console by pressing Ctrl+a three times during the unresponsive period. The console for xen is working.
We don't do heavy I/O in dom0, just deploy some daemons like snmpd...
We are not sure what cause this, but we find a way to reproduce the same symptom: heavy I/O in VMs.
The following is the test configuration:
(Also tried 3.4.3)
CPU: 2 Xeon E5620(2.4GHz, 6 cores, 12 threads) dedicate 1 core.
(The symptom is much easier to be reproduce by dedicate only 1 core to dom0)
Memory: dedicate 2048M to dom0. (Node has 24G memory)
OS: CentOS 5.4. Kernel 2.6.18-164. (I also try 2.6.18-194, 2.6.18-238, and xenlinux 184.108.40.206)
Disk: two SATA disks (Seagate ST3500630NS, 500G)
sda and sdb. sda is used as dom0 OS's root/swap. sdb to is formatted as ext3 fs and used to store VM's image.
VMs (I use 3 VMs)
CPU: 4 VCPUs
Disks: 3 files in dom0's sdb. They are root device, swap, and the disk to IO. (sda1, sda2, sda3 in VM)
OS: CentOS 5.4 base image. And kernel is updated to 220.127.116.11.el5xen, also I've tried xenlinux 18.104.22.168 and 22.214.171.124)
Create ext3 fs on sda3. mount sda3 to a folder. Performing vdbench filesystem I/O on the mount folder.
The I/O behavior is:
(1) Create 300 files, each 99m large
(2) Random select files and sequential write random patterns the file in 64KB blocks. Read the blocks to verify when done.
(3) There is no rate limit, so the program tries its best to do I/O.
I can provide configuration file for the workload if needed.
When running I/O on only 1 VM. The dom0 almost doesn't response. It's very hard to login. (Rarely success)
When running I/O on 3 VMs. The dom0 get worse. Log in is not possbile (block after typing password). The symptom happens as mentioned before.
I also try to log in to dom0 on VGA console, it blocks after typing password.
A pre-logged in session may be still working, I can issue top command. But once I try to open file, it will block there.
The files are attach to VMs by "file://" method: dom0 uses loop device to associate the file and attach the loop device to VMs. From
XEN's manual I found this method is not recommended now. So I've tried tap:aio method to attach this files to VMs. The dom0 seems good
when using this method, but we find when one VM is doing heavy I/O on its disk, other VMs can't perform I/O well. They can't even
finish the booting.
If there are unclear statements please tell me, I will explain in more detail.
Any suggestions and thoughts will be valuable to me, thanks for reading.
Xen-users mailing list