On Fri, Dec 11, 2009 at 10:15:32AM +0000, Tim Deegan wrote:
> Hi,
>
> I seem to have missed this thread earlier; sorry about that.
>
> At 01:39 +0000 on 11 Dec (1260495591), Simon Horman wrote:
> > * The memory access is to a MMIO region - a control register for the NIC.
> > The access is made by the e1000 gPXE driver.
> >
> > * Interestingly, if there is a write to the page (or perhaps even anywhere
> > in the entire 20-page MMIO region?) before a read, the problem doesn't
> > occur. So a simple work-around in the gPXE driver is to just write to a
> > register before reading any.
> >
> > * In sh_page_fault() the call to gfn_to_mfn_guest_dbg(d, gfn, &p2mt) sets
> > p2mt to p2m_mmio_direct, which seems to be correct. I'm a bit
> > stuck in working out what goes wrong from there.
>
> So from the trace below it looks like the shadow fault handler is trying
> to emulate an access to a passthrough MMIO address. That's surprising.
> And if it only happens when the first access is a read, that's a bit
> suspect too.
Sorry, I wasn't as clear as I might have been. The problem occurs
if the first access is a write. The work-around is to make
the first access a read.
> > 106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2, rip=6f6
>
> Oh, but this is a write, according to the error code.
>
> > 106707 sh: sh_page_fault__guest_2(): 2954: A
> > 106708 sh: sh_page_fault__guest_2(): 2998: B
> > 106709 sh: sh_page_fault__guest_2(): 3077: page_fault_slow_path:
> > 106710 sh: sh_page_fault__guest_2(): 3081: C
> > 106711 sh: sh_page_fault__guest_2(): 3095: rewalk:
> > 106712 sh: sh_page_fault__guest_2(): 3097: D
> > 106713 sh: sh_page_fault__guest_2(): 3106: E
> > 106714 sh: sh_page_fault__guest_2(): 3122: F
> > (XEN) _gfn_to_mfn_type_dbg: current
> > 106715 sh: sh_page_fault__guest_2(): 3141: gfn_to_mfn_guest_dbg: p2mt=5
> > 106716 sh: sh_page_fault__guest_2(): 3143: G
> > 106717 sh: sh_page_fault__guest_2(): 3181: H
> > 106718 sh: sh_page_fault__guest_2(): 3193: I
> > 106719 sh: sh_page_fault__guest_2(): 3204: J
> > 106720 sh: sh_page_fault__guest_2(): 3216: K
> > 106721 shdebug: make_fl1_shadow(): (f3000)=>118644
>
> FL1 shadow means we're building a shadow L1 that's equivalent to a
> single PSE guest entry, pointing a guest frames f3000 to f31ff.
>
> > 106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002, smfn=118644
> > 106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7 shadow
> > 0000000118644067
> > 106724 sh: sh_page_fault__guest_2(): 3241: L
> > 106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067 shadow
> > 00000000f0500037
> > 106726 sh: sh_page_fault__guest_2(): 3263: M
> > 106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow
> > 00000000f0501037
> > 106728 shdebug: _sh_propagate(): prefetch level 1 guest f3002067 shadow
> > 00000000f0502037
>
> [...] and the shadow PTE flags look OK: A, PCD, U/S, R/W, P.
>
> > 106758 sh: sh_page_fault__guest_2(): 3285: N
> > 106759 sh: sh_page_fault__guest_2(): 3310: O
> > 106760 sh: sh_page_fault__guest_2(): 3319: P
> > 106761 sh: sh_page_fault__guest_2(): 3332: Q
> > 106762 sh: sh_page_fault__guest_2(): 3343: goto emulate;
>
> That's the interesting one. I guess your line numbers are different
> because of the extra printout but it must be this:
>
> if ( is_hvm_domain(d)
> && unlikely(!hvm_wp_enabled(v))
> && regs->error_code == (PFEC_write_access|PFEC_page_present) )
> {
> perfc_incr(shadow_fault_emulate_wp);
> goto emulate;
> }
Yes, it is that.
> Hah. That explains why we're emulating. The guest has CR0.WP clear,
> and this was a write fault that wouldn't have happened on real hardware,
> so we need to emulate it because we can't actually disable CR0.WP or the
> shadow pagetables stop working altogether.
>
> In fact in this case we don't need to emulate it because retrying would
> be good enough, but in the general case we might.
I'm not sure that I understand why that is true.
> > 106763 sh: sh_page_fault__guest_2(): 3361: emulate:
> > 106764 sh: sh_page_fault__guest_2(): 3367: R
> > 106765 sh: sh_page_fault__guest_2(): 3390: emulate_readonly:
> > 106766 sh: sh_page_fault__guest_2(): 3403: early_emulation:
> > 106767 sh: sh_page_fault__guest_2(): 3405: S
> > 106768 sh: sh_page_fault__guest_2(): emulate: eip=0x6f6 esp=0x3d264
> > 106769 sh: sh_page_fault__guest_2(): 3446: T
> > 106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn
> > 0xf0500
>
> The emulation fails (because the emulator can't/won't map MMIO space)
> and we're about to bail out, try unshadowing everything and retry (at
> which point everything will just work). Except:
>
> > 106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
>
> sh_remove_shadows isn't ready to be called with an MMIO MFN. We used to
> get away with this before the m2p was made sparse, but now MMIO holes
> mean m2p holes.
>
> Two fixes suggest themselves. The first is to gate the CR0.WP emulation
> path on mfn_valid(). The emulator won't handle it so it only leads to
> confusion and delay if we try:
>
> --- a/xen/arch/x86/mm/shadow/multi.c Fri Dec 11 09:17:09 2009 +0000
a
> +++ b/xen/arch/x86/mm/shadow/multi.c Fri Dec 11 10:06:39 2009 +0000
> @@ -3305,7 +3305,8 @@
> * fault was a non-user write to a present page. */
> if ( is_hvm_domain(d)
> && unlikely(!hvm_wp_enabled(v))
> - && regs->error_code == (PFEC_write_access|PFEC_page_present) )
> + && regs->error_code == (PFEC_write_access|PFEC_page_present)
> + && mfn_valid(gmfn) )
> {
> perfc_incr(shadow_fault_emulate_wp);
> goto emulate;
That resolves the problem that I observed.
>
>
> The second is to make sh_remove_shadows() safe against being called with
> a bogus MFN. I'm not quite decided whether the right answer is to put
> "if (!mfn_valid(gmfn)) return;" at the top of it (which would stop a
> certain class of crashes) or just "ASSERT(gmfn_valid(gmfn));" to make it
> clearer what's gone wrong.
Adding "if (!mfn_valid(gmfn)) return;" near the top of sh_remove_shadows()
also resolves the problem that I was seeing.
Thanks (x2) !
Although I'm not really that familiar with the code in question,
I think that I have a preference for both guarding the goto emulate
and adding an ASSERT to sh_remove_shadows(). That is:
Index: xen-unstable.hg/xen/arch/x86/mm/shadow/common.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/common.c 2009-12-11
20:31:48.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/common.c 2009-12-11
20:48:48.000000000 +0900
@@ -2752,6 +2752,7 @@ void sh_remove_shadows(struct vcpu *v, m
};
ASSERT(!(all && fast));
+ ASSERT(mfn_valid(gmfn));
/* Although this is an externally visible function, we do not know
* whether the shadow lock will be held when it is called (since it
Index: xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/multi.c 2009-12-11
20:47:17.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c 2009-12-11
20:48:17.000000000 +0900
@@ -3305,7 +3305,8 @@ static int sh_page_fault(struct vcpu *v,
* fault was a non-user write to a present page. */
if ( is_hvm_domain(d)
&& unlikely(!hvm_wp_enabled(v))
- && regs->error_code == (PFEC_write_access|PFEC_page_present) )
+ && regs->error_code == (PFEC_write_access|PFEC_page_present)
+ && mfn_valid(gmfn) )
{
perfc_incr(shadow_fault_emulate_wp);
goto emulate;
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|