WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

Re: [Xen-users] "xm save" only works once...

To: Ralph Passgang <ralph@xxxxxxxxxxxxx>
Subject: Re: [Xen-users] "xm save" only works once...
From: Steven Hand <Steven.Hand@xxxxxxxxxxxx>
Date: Mon, 22 Aug 2005 20:45:49 +0100
Cc: Steven.Hand@xxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx
Delivery-date: Mon, 22 Aug 2005 19:44:09 +0000
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: Message from Ralph Passgang <ralph@xxxxxxxxxxxxx> of "Mon, 22 Aug 2005 17:19:42 +0200." <200508221719.42843.ralph@xxxxxxxxxxxxx>
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
>Am Freitag, 19. August 2005 04:14 schrieb Steven Hand:
>> >Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori:
>> >> Steven Hand wrote:
>> >> >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of
>> >> >> ram. I am using 2.6.11 as kernel for my domain 0. Domain 0 uses
>> >> >> Debian Sarge with a backported Xen 2.0.7 package (only litte changes
>> >> >> to the debian 2.0.6 package, nothing important enough to get
>> >> >> metioned). All kernels were compiled against vanilla kernels with
>> >> >> xen-patch. The domain U's are using 2.6.11 or 2.4.30 (debian, suse).
>> >> >>
>> >> >>I have no problems within domains and everything is running very
>> >> >> smoothly, exepct one thing (which was also not working correctly in
>> >> >> xen-2.0.6 for me): I can save a domain with "xm save <domainname>
>> >> >> <suspendfile>" once and I can restore this domain again, but if I try
>> >> >> a second "xm save ..." it simply seems to hang. Nothing happens and
>> >> >> the last thing in the logs are these lines:
>> >> >
>> >> >Is this the same with both 2.4 and 2.6 domUs? I've noticed something
>> >> > similar with 2.0.7 but only with 2.4 domUs ... it would be useful to
>> >> > know if it affects 2.6 also - I'm trying to track it down.
>> >
>> >yes, it's the same with 2.4 and 2.6 domUs...
>> >
>> >> There's a very similiar problem in 3.0 that has to do with a race
>> >> condition with the xc_save/Xend interaction.  xc_save thinks it has sent
>> >> the "suspend" command over the pipe and Xend is waiting for it to
>> >> arrive.
>> >
>> >... but after some more testing I noticed another interessting thing. "xm
>> >save" hangs if the suspend file doesn't exist. For the first time after a
>> >dom0 reboot it's normaly no problem, but if I delete the file and try a
>> > "xm save" again it will not work for 95%.
>> >
>> >If I keep the save-file and then make a "xm save" and a "xm restore" it
>> > seems to be no problem. I made 10 tests and all worked.
>>
>> Fix attached below - it's actually nothing to do with whether the file
>> exists or not. Rather the problem is that on occasion xfrd sends a response
>> and a request in the same 'message', and Xend only deals with the first.
>>
>> The below fixes this for me - please let me know if it works for you,
>
>I can't test it right now, because the server is in production use now. I have
>to schedule a maintaince window to reboot the system (and that is needed if 
>the problem is not fixed and a "xm save" crashes.

Ok (although I'm confident the fix is a strict stability improvement - I 
stress tested over 15,000 save/restore cycles at a variety of frequencies
without a single problem). 

But then again, it's your server :-) 

Since the problem was a race condition and hence timing (and concurrency 
at the hardware level) are likely to affect the probability of it occurring. 
So e.g. SMP versus not, or slow versus fast machine, or anything like this
could increase the chance you'd see it. 

>I let you know if I could test the patch on the production system (or another 
>smp/ht system), but that can take some more days... sorry.

No probs - the fix is in 2.0-testing but that also includes a bunch of 
other stuff, so probably best to just apply that patch locally. 

cheers,

S.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users