Re: [Xen-devel] slow live magration / xc_restore on xen4 pvops

Hi Andreas,

This is an interesting bug, to be sure. I think you need to modify the
restore code to get a better idea of what's going on. The file in the Xen
tree is tools/libxc/xc_domain_restore.c. You will see it contains many
DBGPRINTF and DPRINTF calls, some of which are commented out, and some of
which may 'log' at too low a priority level to make it to the log file. For
your purposes you might change them to ERROR calls as they will definitely
get properly logged. One area of possible concern is that our read function
(RDEXACT, which is a macro mapping to rdexact) was modified for Remus to
have a select() call with a timeout of 1000ms. Do I entirely trust it? Not
when we have the inexplicable behaviour that you're seeing. So you might try
mapping RDEXACT() to read_exact() instead (which is what we already do when
building for __MINIOS__).

This all assumes you know your way around C code at least a little bit.

 -- Keir

On 01/06/2010 22:17, "Andreas Olsowski" <andreas.olsowski@xxxxxxxxxxxxxxx>
wrote:

> Hi,
> 
> in preparation for our soon to arrive central storage array i wanted to
> test live magration and remus replication and stumbled upon a  problem.
> When migrating a test-vm (512megs ram, idle) between my 3 servers two of
> them are extremely slow in "receiving" the vm. There is little to no cpu
> utilization from xc_restore until shortly before migration is complete.
> The same goes for xm restore.
> The xend.log contains:
> [2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:286)
> restore:shadow=0x0, _static_max=0x20000000, _static_min=0x0,
> [2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:305) [xc_restore]:
> /usr/lib/xen/bin/xc_restore 48 43 1 2 0 0 0 0
> [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) xc_domain_restore
> start: p2m_size = 20000
> [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) Reloading memory
> pages:   0%
> [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal
> error: Error when reading batch size
> [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal
> error: error when buffering batch, finishing
> 
> When receiving a vm via live migration finally finishes. You can see the
> large gap in the timestamps.
> The vm is perfectly fine after that, it just takes way too long.
> 
> 
> First off let me explain my server setup, detailed information on trying
> to narrow down the error follows.
> I have 3 servers running xen4 with 2.6.31.13-pvops as kernel, its the
> current kernel from jeremy's xen/master git branch.
> The guests are running vanilla 2.6.32.11 kernels.
> 
> The 3 servers differ slightly in hardware, two are Dell PE 2950 and one
> is a Dell R710, the 2950's have 2 Quad-Xeon CPUs (L5335 and L5410), the
> R710 has 2 Quad Xeon E5520.
> All machines have 24gigs of RAM.
> 
> They are called "tarballerina" (E5520), "xentruio1" (L5335) ad
> "xenturio2" (L5410).
> 
> Currently i use tarballerina for testing purposes but i dont consider
> anything in my setup "stable".
> xenturio1 has 27 guests running, xenturio2 25.
> No guest does anything that would even put a dent into the systems
> performance (ldap servers, radius, department webservers, etc.).
> 
> I created a test-vm on my current central iscsi storage, called "hatest"
> that idles around, has 2 VCPUs and 512megs of ram.
> 
> First i testen xm save/restore:
> tarballerina:~# time xm restore /var/saverestore-t.mem
> real    0m13.227s
> user    0m0.090s
> sys     0m0.023s
> xenturio1:~# time xm restore /var/saverestore-x1.mem
> real    4m15.173s
> user    0m0.138s
> sys     0m0.029s
> 
> 
> When migrating to xenturio1 or 2 it the migration takes 181 to 278
> seconds, when migrating it to tarballerina it takes rougly 30seconds:
> tarballerina:~# time xm migrate --live hatest 10.0.1.98
> real    3m57.971s
> user    0m0.086s
> sys     0m0.029s
> xenturio1:~# time xm migrate --live hatest 10.0.1.100
> real    0m43.588s
> user    0m0.123s
> sys     0m0.034s
> 
> 
> --- attempt of narrowing it down ----
> My first guess was that since tarballerina had almost no guest running
> that did anything, it could be a issue of memory usage by the tapdisk2
> processes (each dom0 has been mem-set to 4096M).
> I then started almost all vms that i have on tarballerina:
> tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
> real    0m2.884s
> tarballerina:~# time xm restore /var/saverestore-t.mem
> real    0m15.594s
> 
> 
> i tried this several times, sometimes it too 30+ seconds.
> 
> Then i started 2 VMs that run load and io generating processes  (stress,
> dd, openssl encryption, md5sum).
> But this didnt affect xm restore perfomance, it still was quite fast:
> tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
> real    0m7.476s
> user    0m0.101s
> sys     0m0.022s
> tarballerina:~# time xm restore /var/saverestore-t.mem
> real    0m45.544s
> user    0m0.094s
> sys     0m0.022s
> 
> i tried several times again, restore took 17 to 45 seconds
> 
> Then i tried migrating the test-vm to tarballerina again, still fast,
> inspite of several vms including load and io generating vms:
> This ate almost all available ram.
> cputimes for xc_restore according to target machine's "top":
> tarballerina -> xenturio1: 0:05:xx , cpu 2-4%, near the end 40%.
> xenturio1 > tarballerina: 0:04:xx, cpu 4-8%, near the end 54%.
> 
> tarballerina:~# time xm migrate --live hatest 10.0.1.98
> real    3m29.779s
> user    0m0.102s
> sys     0m0.017s
> xenturio1:~# time xm migrate --live hatest 10.0.1.100
> real    0m28.386s
> user    0m0.154s
> sys     0m0.032s
> 
> 
> so my attempt of narrowing the problem down failed, its neither the free
> memory of the dom0 nor the load, io or the memory the other domUs utilize.
> ---end attempt---
> 
> More info(xm list, meminfo, table with migration times, etc.) on my
> setup can be found here:
> http://andiolsi.rz.uni-lueneburg.de/node/37
> 
> There was another guy who has the same error in his logfile, this might
> be unrelated or not:
> http://lists.xensource.com/archives/html/xen-users/2010-05/msg00318.html
> 
> Further information can be given, should demand for i arise.
> 
> With best regards
> 
> ---
> Andreas Olsowski <andreas.olsowski@xxxxxxxxxxxxxxx>
> Leuphana Universität Lüneburg
> System- und Netzwerktechnik
> Rechenzentrum, Geb 7, Raum 15
> Scharnhorststr. 1
> 21335 Lüneburg
> 
> Tel: ++49 4131 / 6771309
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] slow live magration / xc_restore on xen4 pvops