[Xen-devel] slow live magration / xc_restore on xen4 pvops

Hi,

in preparation for our soon to arrive central storage array i wanted totest live magration and remus replication and stumbled upon a problem.When migrating a test-vm (512megs ram, idle) between my 3 servers two ofthem are extremely slow in "receiving" the vm. There is little to no cpuutilization from xc_restore until shortly before migration is complete.

The same goes for xm restore.
The xend.log contains:

[2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:286)restore:shadow=0x0, _static_max=0x20000000, _static_min=0x0,[2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:305) [xc_restore]:/usr/lib/xen/bin/xc_restore 48 43 1 2 0 0 0 0[2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) xc_domain_restorestart: p2m_size = 20000[2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) Reloading memorypages: 0%[2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internalerror: Error when reading batch size[2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internalerror: error when buffering batch, finishing

When receiving a vm via live migration finally finishes. You can see thelarge gap in the timestamps.

The vm is perfectly fine after that, it just takes way too long.

First off let me explain my server setup, detailed information on tryingto narrow down the error follows.I have 3 servers running xen4 with 2.6.31.13-pvops as kernel, its thecurrent kernel from jeremy's xen/master git branch.

The guests are running vanilla 2.6.32.11 kernels.

The 3 servers differ slightly in hardware, two are Dell PE 2950 and oneis a Dell R710, the 2950's have 2 Quad-Xeon CPUs (L5335 and L5410), theR710 has 2 Quad Xeon E5520.

All machines have 24gigs of RAM.

They are called "tarballerina" (E5520), "xentruio1" (L5335) ad"xenturio2" (L5410).

Currently i use tarballerina for testing purposes but i dont consideranything in my setup "stable".

xenturio1 has 27 guests running, xenturio2 25.

No guest does anything that would even put a dent into the systemsperformance (ldap servers, radius, department webservers, etc.).

I created a test-vm on my current central iscsi storage, called "hatest"that idles around, has 2 VCPUs and 512megs of ram.


First i testen xm save/restore:
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m13.227s
user    0m0.090s
sys     0m0.023s
xenturio1:~# time xm restore /var/saverestore-x1.mem
real    4m15.173s
user    0m0.138s
sys     0m0.029s

When migrating to xenturio1 or 2 it the migration takes 181 to 278seconds, when migrating it to tarballerina it takes rougly 30seconds:

tarballerina:~# time xm migrate --live hatest 10.0.1.98
real    3m57.971s
user    0m0.086s
sys     0m0.029s
xenturio1:~# time xm migrate --live hatest 10.0.1.100
real    0m43.588s
user    0m0.123s
sys     0m0.034s


--- attempt of narrowing it down ----

My first guess was that since tarballerina had almost no guest runningthat did anything, it could be a issue of memory usage by the tapdisk2processes (each dom0 has been mem-set to 4096M).

I then started almost all vms that i have on tarballerina:
tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
real    0m2.884s
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m15.594s


i tried this several times, sometimes it too 30+ seconds.

Then i started 2 VMs that run load and io generating processes (stress,dd, openssl encryption, md5sum).

But this didnt affect xm restore perfomance, it still was quite fast:
tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
real    0m7.476s
user    0m0.101s
sys     0m0.022s
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m45.544s
user    0m0.094s
sys     0m0.022s

i tried several times again, restore took 17 to 45 seconds

Then i tried migrating the test-vm to tarballerina again, still fast,inspite of several vms including load and io generating vms:

This ate almost all available ram.
cputimes for xc_restore according to target machine's "top":
tarballerina -> xenturio1: 0:05:xx , cpu 2-4%, near the end 40%.
xenturio1 > tarballerina: 0:04:xx, cpu 4-8%, near the end 54%.

tarballerina:~# time xm migrate --live hatest 10.0.1.98
real    3m29.779s
user    0m0.102s
sys     0m0.017s
xenturio1:~# time xm migrate --live hatest 10.0.1.100
real    0m28.386s
user    0m0.154s
sys     0m0.032s

so my attempt of narrowing the problem down failed, its neither the freememory of the dom0 nor the load, io or the memory the other domUs utilize.

---end attempt---

More info(xm list, meminfo, table with migration times, etc.) on mysetup can be found here:

http://andiolsi.rz.uni-lueneburg.de/node/37

There was another guy who has the same error in his logfile, this mightbe unrelated or not:

http://lists.xensource.com/archives/html/xen-users/2010-05/msg00318.html

Further information can be given, should demand for i arise.

With best regards

---
Andreas Olsowski <andreas.olsowski@xxxxxxxxxxxxxxx>
Leuphana Universität Lüneburg
System- und Netzwerktechnik
Rechenzentrum, Geb 7, Raum 15
Scharnhorststr. 1
21335 Lüneburg

Tel: ++49 4131 / 6771309



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] slow live magration / xc_restore on xen4 pvops