Re: [Xen-devel] [HVM] Corruption of buffered_io_page

To:	xen-devel@xxxxxxxxxxxxxxxxxxx
Subject:	Re: [Xen-devel] [HVM] Corruption of buffered_io_page
From:	"Trolle Selander" <trolle.selander@xxxxxxxxx>
Date:	Thu, 7 Dec 2006 11:56:39 +0100
Delivery-date:	Thu, 07 Dec 2006 02:57:42 -0800
Domainkey-signature:	a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=YRB+ikjW+B3CBlaWYre1WB8F50bqZbg8+KMImurvjhmKNXjRlnPvjgfbQN6OeTI5bu6ClblSXtWMb6DSVNe3+XIfpV8+GoYWljOUErhOqvw0OzrfGeN8o6tTtCeKuhIZ3Iing7ZTrNyzrFHYl3mu9oJnTaFLpDr3MZHunS4OyVs=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
In-reply-to:	<907625E08839C4409CE5768403633E0B018E17BC@xxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<515922b50612070150g7ec13b8ob0a6dd5ac96ad5d3@xxxxxxxxxxxxxx> <907625E08839C4409CE5768403633E0B018E17BC@xxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

Changeset 12622 was the one i thought was the "big patch" since it showed up one or two days after Keir said he was going to post it. In any case, it fixed both the first segment base issue i ran into that I originally mailed you about as well as a second one with a non-zero stack segment base that I hit immediately after when I did a "quick and dirty" fix for the cs seg_base. I haven't seen any "obviously" segment-base related things since then.

As for seeing the bad value in any of the registers around the time the corruption occurs, I've looked, but I haven't spotted it yet.

On 12/7/06, Petersson, Mats <Mats.Petersson@xxxxxxx> wrote:

> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> [mailto: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of
> Trolle Selander
> Sent: 07 December 2006 09:51
> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] [HVM] Corruption of buffered_io_page
>
> I thought i had replied to the list, but apparently gmail's
> default reply action goes to the last poster, not the mailing
> list. Ian got these answers already, but I'll cut & paste to
> make sure it gets to the general list as well:
>
>
>
>       Distro is FC6, compiler is gcc-4.1.1, guest os is OS/2,
> worload is the boot process. :)
>       No drivers are loaded yet at this stage - it happens
> fairly early in the boot. However, after I added the
> corruption-catching "padding" struct member, the boot does in
> fact progress to the driver loading stage, although with
> severely corrupted boot-logo graphics.
>       Since it currently happens reproducibly after a
> specific "no op" vmexit (read from an unused port), Mats's
> suggestion of marking the iopage read-only sounds doable if I
> insert code to set the page readonly when this specific
> vmexit occurs. From what I saw when running qemu in the
> debugger, there's no "proper" use of the page about to occur,
> so the only thing that will write to it should be whatever is
> doing it erroneously. I'll try that tomorrow.
>
>       One correction: I managed to confuse myself a bit here.
> The very last vmexit_ioio at which the guest stalls is a read
> from 0x1f7, but when that io happens, the iopage is already
> corrupted, and that's why it stalls - qemu-dm is "stuck" and
> never performs the io. The port 0x23 is the io preceeding
> that one - the last one that "gets through", which is why
> that was the one I've used to trace things.
>
>       I don't know if it's any clue to anyone, but the bad
> value that gets written into read_pointer is 0x1df1000.

One thing is for sure, it's not a page-table entry. But it could be the
value of a physical page-address. Is this value in any of the registers
around the time of the crash?

>
>
>
> Now to what you said - I thought Keir's patch to fixed up all
> the segment base = 0 assumptions in x86_emulate? At least
> we're past the problem that was causing that I posted about
> before. I must confess I never actually looked at the code,
> because Keir said the patch would fix all the segment base =
> 0 assumptions, and once the patch showed up in mercurial and
> I built from that changeset, I didn't hit the seg.base != 0
> problem anymore.

There was patch(es) from Jan Beulich to fix the HVM side of seg.base !=
0, but as far as I've seen, Keir hasn't posted his "big patch" yet -
it's possible that Keir could send you a "private patch". This patch
fixed x86_emulate.c, which isn't in the HVM section, it's the part that
fixes up page-table writes (which you may not see many of at the early
part of boot, so it may not be an issue, of course).

Whilst marking the page read-only and trapping on the page-fault is
indeed doable, I'd also add a check on every vmexit and just at the time
of doing vmrun (in the C code just before calling svm_asm_do_resume() or
some such).

--
Mats
>
>
> On 12/7/06, Petersson, Mats < Mats.Petersson@xxxxxxx> wrote:
>
>
>
>       > -----Original Message-----
>       > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>       > [mailto: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> <mailto: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx> ] On Behalf Of
> Ian Pratt
>       > Sent: 06 December 2006 22:05
>       > To: Trolle Selander; xen-devel@xxxxxxxxxxxxxxxxxxx
>       > Subject: RE: [Xen-devel] [HVM] Corruption of buffered_io_page
>       >
>       > > read_pointer is the first member of buffered_ioreq_t, so on
>       > the hunch
>       > that
>       > > the corruption was occuring by something other than
> a wrong value
>       > actually
>       > > being written into the structure member, either overflowing
>       > a previous
>       > > structure in memory or a pointer var mistake. I
> thus added a 64bit
>       > dummy
>       > > member to "pad" the buffered_ioreq_t structure at the
>       > start, and as I
>       > had
>       > > suspected, the bad value does get written into this
> dummy member
>       > rather
>       > > than the read_pointer. I haven't (yet) been able to track
>       > down what it
>       > is
>       > > that actually writes the bad value, and any help
> finding it would be
>       > > welcome.
>       >
>       > What compiler are you using? What guest OS? Are you using PV
>       > or emulated
>       > drivers? Any idea if there are particular workloads
> that provoke the
>       > problem?
>
>       I'll answer for Trolle as best as I can:
>       Compiler: gcc 4.1 I believe.
>       Guest OS: OS/2
>       Drivers would be emulated ones.
>       I think it's failing during initial boot, as Trolle
> hasn't told me "It
>       works" yet... ;-)
>
>       By the way, I'm still a bit worried that this is caused
> by segment base
>       != 0 in x86_emulate.c - this can cause all sorts of
> "interesting"
>       interaction between the page-table updates and actual
> memory being
>       affected.
>
>       --
>       Mats
>       >
>       > Best,
>       > Ian
>       >
>       >
>       > _______________________________________________
>       > Xen-devel mailing list
>       > Xen-devel@xxxxxxxxxxxxxxxxxxx
> <mailto:Xen-devel@xxxxxxxxxxxxxxxxxxx >
>       > http://lists.xensource.com/xen-devel
>       >
>       >
>       >
>
>
>
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [HVM] Corruption of buffered_io_page