Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range

On Fri, Dec 11, 2009 at 10:15:32AM +0000, Tim Deegan wrote:
> Hi,
> 
> I seem to have missed this thread earlier; sorry about that. 
> 
> At 01:39 +0000 on 11 Dec (1260495591), Simon Horman wrote:
> > * The memory access is to a MMIO region - a control register for the NIC.
> >   The access is made by the e1000 gPXE driver.
> > 
> > * Interestingly, if there is a write to the page (or perhaps even anywhere
> >   in the entire 20-page MMIO region?) before a read, the problem doesn't
> >   occur. So a simple work-around in the gPXE driver is to just write to a
> >   register before reading any.
> > 
> > * In sh_page_fault() the call to gfn_to_mfn_guest_dbg(d, gfn, &p2mt) sets
> >   p2mt to p2m_mmio_direct, which seems to be correct. I'm a bit
> >   stuck in working out what goes wrong from there.
> 
> So from the trace below it looks like the shadow fault handler is trying
> to emulate an access to a passthrough MMIO address.  That's surprising. 
> And if it only happens when the first access is a read, that's a bit 
> suspect too. 

Sorry, I wasn't as clear as I might have been. The problem occurs
if the first access is a write. The work-around is to make
the first access a read.

> > 106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2, rip=6f6
> 
> Oh, but this is a write, according to the error code. 
> 
> > 106707 sh: sh_page_fault__guest_2(): 2954: A
> > 106708 sh: sh_page_fault__guest_2(): 2998: B
> > 106709 sh: sh_page_fault__guest_2(): 3077: page_fault_slow_path:
> > 106710 sh: sh_page_fault__guest_2(): 3081: C
> > 106711 sh: sh_page_fault__guest_2(): 3095: rewalk:
> > 106712 sh: sh_page_fault__guest_2(): 3097: D
> > 106713 sh: sh_page_fault__guest_2(): 3106: E
> > 106714 sh: sh_page_fault__guest_2(): 3122: F
> > (XEN) _gfn_to_mfn_type_dbg: current
> > 106715 sh: sh_page_fault__guest_2(): 3141: gfn_to_mfn_guest_dbg: p2mt=5
> > 106716 sh: sh_page_fault__guest_2(): 3143: G
> > 106717 sh: sh_page_fault__guest_2(): 3181: H
> > 106718 sh: sh_page_fault__guest_2(): 3193: I
> > 106719 sh: sh_page_fault__guest_2(): 3204: J
> > 106720 sh: sh_page_fault__guest_2(): 3216: K
> > 106721 shdebug: make_fl1_shadow(): (f3000)=>118644
> 
> FL1 shadow means we're building a shadow L1 that's equivalent to a
> single PSE guest entry, pointing a guest frames f3000 to f31ff.
> 
> > 106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002, smfn=118644
> > 106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7 shadow 
> > 0000000118644067
> > 106724 sh: sh_page_fault__guest_2(): 3241: L
> > 106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067 shadow 
> > 00000000f0500037
> > 106726 sh: sh_page_fault__guest_2(): 3263: M
> > 106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow 
> > 00000000f0501037
> > 106728 shdebug: _sh_propagate(): prefetch level 1 guest f3002067 shadow 
> > 00000000f0502037
> 
> [...] and the shadow PTE flags look OK: A, PCD, U/S, R/W, P.
> 
> > 106758 sh: sh_page_fault__guest_2(): 3285: N
> > 106759 sh: sh_page_fault__guest_2(): 3310: O
> > 106760 sh: sh_page_fault__guest_2(): 3319: P
> > 106761 sh: sh_page_fault__guest_2(): 3332: Q
> > 106762 sh: sh_page_fault__guest_2(): 3343: goto emulate;
> 
> That's the interesting one.  I guess your line numbers are different
> because of the extra printout but it must be this:
> 
>     if ( is_hvm_domain(d) 
>          && unlikely(!hvm_wp_enabled(v)) 
>          && regs->error_code == (PFEC_write_access|PFEC_page_present) )
>     {
>         perfc_incr(shadow_fault_emulate_wp);
>         goto emulate;
>     }

Yes, it is that.

> Hah.  That explains why we're emulating.  The guest has CR0.WP clear,
> and this was a write fault that wouldn't have happened on real hardware,
> so we need to emulate it because we can't actually disable CR0.WP or the
> shadow pagetables stop working altogether.
> 
> In fact in this case we don't need to emulate it because retrying would
> be good enough, but in the general case we might.

I'm not sure that I understand why that is true.

> > 106763 sh: sh_page_fault__guest_2(): 3361: emulate:
> > 106764 sh: sh_page_fault__guest_2(): 3367: R
> > 106765 sh: sh_page_fault__guest_2(): 3390: emulate_readonly:
> > 106766 sh: sh_page_fault__guest_2(): 3403: early_emulation:
> > 106767 sh: sh_page_fault__guest_2(): 3405: S
> > 106768 sh: sh_page_fault__guest_2(): emulate: eip=0x6f6 esp=0x3d264
> > 106769 sh: sh_page_fault__guest_2(): 3446: T
> > 106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn 
> > 0xf0500
> 
> The emulation fails (because the emulator can't/won't map MMIO space)
> and we're about to bail out, try unshadowing everything and retry (at
> which point everything will just work).  Except:
> 
> > 106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
> 
> sh_remove_shadows isn't ready to be called with an MMIO MFN.  We used to
> get away with this before the m2p was made sparse, but now MMIO holes
> mean m2p holes. 
> 
> Two fixes suggest themselves.  The first is to gate the CR0.WP emulation
> path on mfn_valid().  The emulator won't handle it so it only leads to
> confusion and delay if we try:
> 
> --- a/xen/arch/x86/mm/shadow/multi.c  Fri Dec 11 09:17:09 2009 +0000
a
> +++ b/xen/arch/x86/mm/shadow/multi.c  Fri Dec 11 10:06:39 2009 +0000
> @@ -3305,7 +3305,8 @@
>       * fault was a non-user write to a present page.  */
>      if ( is_hvm_domain(d) 
>           && unlikely(!hvm_wp_enabled(v)) 
> -         && regs->error_code == (PFEC_write_access|PFEC_page_present) )
> +         && regs->error_code == (PFEC_write_access|PFEC_page_present) 
> +         && mfn_valid(gmfn) )
>      {
>          perfc_incr(shadow_fault_emulate_wp);
>          goto emulate;

That resolves the problem that I observed.
> 
> 
> The second is to make sh_remove_shadows() safe against being called with
> a bogus MFN.  I'm not quite decided whether the right answer is to put 
> "if (!mfn_valid(gmfn)) return;" at the top of it (which would stop a
> certain class of crashes) or just "ASSERT(gmfn_valid(gmfn));" to make it
> clearer what's gone wrong.  

Adding "if (!mfn_valid(gmfn)) return;" near the top of sh_remove_shadows()
also resolves the problem that I was seeing.

Thanks (x2) !

Although I'm not really that familiar with the code in question,
I think that I have a preference for both guarding the goto emulate
and adding an ASSERT to sh_remove_shadows(). That is:

Index: xen-unstable.hg/xen/arch/x86/mm/shadow/common.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/common.c        2009-12-11 
20:31:48.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/common.c     2009-12-11 
20:48:48.000000000 +0900
@@ -2752,6 +2752,7 @@ void sh_remove_shadows(struct vcpu *v, m
     };
 
     ASSERT(!(all && fast));
+    ASSERT(mfn_valid(gmfn));
 
     /* Although this is an externally visible function, we do not know
      * whether the shadow lock will be held when it is called (since it
Index: xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/multi.c 2009-12-11 
20:47:17.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c      2009-12-11 
20:48:17.000000000 +0900
@@ -3305,7 +3305,8 @@ static int sh_page_fault(struct vcpu *v,
      * fault was a non-user write to a present page.  */
     if ( is_hvm_domain(d) 
          && unlikely(!hvm_wp_enabled(v)) 
-         && regs->error_code == (PFEC_write_access|PFEC_page_present) )
+         && regs->error_code == (PFEC_write_access|PFEC_page_present)
+         && mfn_valid(gmfn) )
     {
         perfc_incr(shadow_fault_emulate_wp);
         goto emulate;


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned