[Xen-devel] "Stall" when booting OS/2 in an AMD-V HVM - regressi

To:	xen-devel@xxxxxxxxxxxxxxxxxxx
Subject:	[Xen-devel] "Stall" when booting OS/2 in an AMD-V HVM - regression compared to 3.0.2
From:	"Trolle Selander" <trolle.selander@xxxxxxxxx>
Date:	Tue, 14 Nov 2006 14:28:08 +0100
Delivery-date:	Tue, 14 Nov 2006 05:28:16 -0800
Domainkey-signature:	a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type; b=UnDYmwvt1ck/qShLPj70qYLU39rY1ov+ZIrh8ZMU39pwQRXtEb5SX9Lxb/IuM7a7V9l7Gtwy1pPX2f7ikgyf3dHDPtKvceJ8i3sMxzEs3jFRnwU5CVJBckc7jI9fWAH9XQOusZs+F8PyvbRlzBGpGV6wdNLiXt0sxawkqvW7Cag=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

For the last few weeks, I've been trying to figure out why OS/2 stalls early in the boot process. The situation with 3.0.3 and the current xen-unstable is a regression compared to 3.0.2, which got well into the config.sys parsing/driver-loading phase, but got stuck when trying to load block device drivers (both floppy and ide drivers caused that - probably an issue with the ancient version of qemu-dm that was part of that release).

The "superficial" symptoms of the current hang is that the boot process stalls very early (but not before entering protected mode and printing "OS/2" in the upper left corner, for those familiar with the OS/2 boot process), and qemu-dm starts eating 100% cpu in dom0. Moreover, if using sdl output, the sdl window will "hang", and become a black hole for mouse/keyboard input. There's no way to make it "release" mouse & keyboard other than killing the OS/2 domain.

Furthermore, on a single-cpu machine, xentrace output will stop completely when the stall is hit, but will resume once the stalled domain is destroyed. On a dualcore, however, xentrace output does not cease.

The more detailed "behind the scenes" chain of events, as far as I've been able to figure out (disclaimer: I don't know how to step-debug things at this level, so these are my conclusions from putting various amount of trace code and printk's to follow the execution path) is this:

First thing that happens is a VMEXIT_IOIO from a write to port 0x71.
It's writing 0x02 to RTC CMOS register 0x0b. As far as I can tell, there's nothhing strange about that - 0x2 is the "default" value for that reg. As a side note, in the 3.03 release, the stall was another RTC-related write ( 0x8d to port 0x70 ), but after the rtc emulation got moved from qemu-dm into the hypervisor in changeset 11817, the stall moved to a few io-ops later, to the point where it's sticking now.

Anyway, the path through svm_vmexit_handler->svm_io_instruction>rtc emulation etc all appear to process ok. It's what happens next that's interesting: once it goes back to svm_asm_do_resume, it never gets back from svm_test_all_events/svm_process_softirqs. Initially, I thought it was a race, looping over svm_test_all_events->svm_process_softirqs with a TIMER_SOFTIRQ getting raised fast enough that it never got out of the do_softirq loop, but now I'm not so sure, since it appears that the looping I'm seeing is actually not happening in the HVM domain's context, except for the very first time it enters do_softirq - after the first SCHEDULE_SOFTIRQ, it appears the HVM domain simply never gets scheduled again.
Another thing i can see is that do_softirq gets called three times without ever "returning" - first time is in the HVM's context, when it gets called from exits.S. Next, it's called twice with "current" pointing to the idle domain, and then finally it's called with current pointing to dom0, and that time it gets out of the loop and "returns" normally. The "non-returning" occasions are all when it calls the handler for SCHEDULE_SOFTIRQ, which never "returns" - and which may be the proper and as-designed behavior, for all I know.

After those initial four entries into do_softirq, subsequent entries into do_softirq has current->domain->domain_id being either 0 or 0x7FFF, with a ratio of about 75% idle-domain, 25% dom0. It seems somewhat strange that the idle domain would get scheduled as much as dom0, given that due to qemu-dm spinning madly in dom0, the system is actually 100% "busy" after the HVM stalled.

I'm continuing to dig into this on my own, but any pointers or corrections of wrongful assumptions on my part are welcome, since delving into Xen is my first serious excursion into the scary territories hiding beneath the safe and comfortable realm of Userland on a modern x86 architecture. :)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] "Stall" when booting OS/2 in an AMD-V HVM - regression compa