Re: [Xen-devel] Need help with fixing the Xen waitqueue feature

To:	Olaf Hering <olaf@xxxxxxxxx>
Subject:	Re: [Xen-devel] Need help with fixing the Xen waitqueue feature
From:	Keir Fraser <keir.xen@xxxxxxxxx>
Date:	Tue, 08 Nov 2011 22:54:52 +0000
Cc:	xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date:	Tue, 08 Nov 2011 15:07:46 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=user-agent:date:subject:from:to:cc:message-id:thread-topic :thread-index:in-reply-to:mime-version:content-type :content-transfer-encoding; bh=+V9RilHIzVbWMBuWyWJLy9OEaIH+cIGTzhpbzOqaVk4=; b=eg8lUuRguh+p9b8RJFaiiTUjUGzfpr8NpkqTAq0r2vMwqVNqpBSS1zY4BdjLZusoVp ksETxP76lP3AEKrQjVAAi6la9ppVXeNoOvCOYDuOK40CqqF2xfWIzUhKWHAGtuSCva7y Y7cNvigIBq1rTsqBDxK8Qh7CrmQvHtGkcP4D8=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<20111108222011.GA23969@xxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AcyeaWy9VwMMr8R9aEGkD4pv/L4XAw==
Thread-topic:	[Xen-devel] Need help with fixing the Xen waitqueue feature
User-agent:	Microsoft-Entourage/12.31.0.110725

On 08/11/2011 22:20, "Olaf Hering" <olaf@xxxxxxxxx> wrote:

> On Tue, Nov 08, Keir Fraser wrote:
> 
>> On 08/11/2011 21:20, "Olaf Hering" <olaf@xxxxxxxxx> wrote:
>> 
>>> Another thing is that sometimes the host suddenly reboots without any
>>> message. I think the reason for this is that a vcpu whose stack was put
>>> aside and that was later resumed may find itself on another physical
>>> cpu. And if that happens, wouldnt that invalidate some of the local
>>> variables back in the callchain? If some of them point to the old
>>> physical cpu, how could this be fixed? Perhaps a few "volatiles" are
>>> needed in some places.
>> 
>> From how many call sites can we end up on a wait queue? I know we were going
>> to end up with a small and explicit number (e.g., in __hvm_copy()) but does
>> this patch make it a more generally-used mechanism? There will unavoidably
>> be many constraints on callers who want to be able to yield the cpu. We can
>> add Linux-style get_cpu/put_cpu abstractions to catch some of them. Actually
>> I don't think it's *that* common that hypercall contexts cache things like
>> per-cpu pointers. But every caller will need auditing, I expect.
> 
> I havent started to audit the callers. In my testing
> mem_event_put_request() is called from p2m_mem_paging_drop_page() and
> p2m_mem_paging_populate(). The latter is called from more places.

Tbh I wonder anyway whether stale hypercall context would be likely to cause
a silent machine reboot. Booting with max_cpus=1 would eliminate moving
between CPUs as a cause of inconsistencies, or pin the guest under test.
Another problem could be sleeping with locks held, but we do test for that
(in debug builds at least) and I'd expect crash/hang rather than silent
reboot. Another problem could be if the vcpu has its own state in an
inconsistent/invalid state temporarily (e.g., its pagetable base pointers)
which then is attempted to be restored during a waitqueue wakeup. That could
certainly cause a reboot, but I don't know of an example where this might
happen.

 -- Keir

> My plan is to put the sleep into ept_get_entry(), but I'm not there yet.
> First I want to test waitqueues in a rather simple code path like
> mem_event_put_request().
> 
>> A sudden reboot is very extreme. No message even on a serial line? That most
>> commonly indicates bad page tables. Most other bugs you'd at least get a
>> double fault message.
> 
> There is no output on serial, I boot with this cmdline:
>   vga=mode-normal console=com1 com1=57600 loglvl=all guest_loglvl=all
>   sync_console conring_size=123456 maxcpus=8 dom0_vcpus_pin
>   dom0_max_vcpus=2
> My base changeset is 24003, the testhost is a Xeon X5670  @ 2.93GHz.
> 
> Olaf



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Need help with fixing the Xen waitqueue feature