[Xen-devel] Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm:

On Mon, 2 May 2011, Konrad Rzeszutek Wilk wrote:
> As a consequence of the commit:
> 
> commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> Author: Yinghai Lu <yinghai@xxxxxxxxxx>
> Date:   Fri Dec 17 16:58:28 2010 -0800
> 
>     x86-64, mm: Put early page table high
> 
> it causes the Linux kernel to crash under Xen:
> 
> mapping kernel into physical memory
> Xen: setup ISA identity maps
> about to get started...
> (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) 
> for mfn b1d89 (pfn bacf7)
> (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 
> [ec=0000]
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> ...
> 
> The reason is that at some point init_memory_mapping is going to reach
> the pagetable pages area and map those pages too (mapping them as normal
> memory that falls in the range of addresses passed to init_memory_mapping
> as argument). Some of those pages are already pagetable pages (they are
> in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
> mapped RO and everything is fine.
> Some of these pages are not pagetable pages yet (they fall in the range
> pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> are going to be mapped RW.  When these pages become pagetable pages and
> are hooked into the pagetable, xen will find that the guest has already
> a RW mapping of them somewhere and fail the operation.
> The reason Xen requires pagetables to be RO is that the hypervisor needs
> to verify that the pagetables are valid before using them. The validation
> operations are called "pinning" (more details in arch/x86/xen/mmu.c).
> 
> In order to fix the issue we mark all the pages in the entire range
> pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> init_memory_mapping. Hence the kernel is going to crash as soon as one
> of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> ranges are RO).
> 
> For this reason, this function is introduced which is called _after_
> the init_memory_mapping has completed (in a perfect world we would
> call this function from init_memory_mapping, but lets ignore that).
> 
> Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> end,top] have all changed to new values (b/c another init_memory_mapping
> is called). Hence, the first time we enter this function, we save
> away the pgt_buf_start value and update the pgt_buf_[end,top].
> 
> When we detect that the "old" pgt_buf_start through pgt_buf_end
> PFNs have been reserved (so memblock_x86_reserve_range has been called),
> we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> 
> And then we update those "old" pgt_buf_[end|top] with the new ones
> so that we can redo this on the next pagetable.
> 
> Reviewed-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> [v1: Updated with Jeremy's comments]
> [v2: Added the crash output]
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  arch/x86/xen/mmu.c |  123 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 123 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index aef7af9..1bca25f 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1463,6 +1463,119 @@ static int xen_pgd_alloc(struct mm_struct *mm)
>       return ret;
>  }
>  
> +#ifdef CONFIG_X86_64
> +static __initdata u64 __last_pgt_set_rw = 0;
> +static __initdata u64 __pgt_buf_start = 0;
> +static __initdata u64 __pgt_buf_end = 0;
> +static __initdata u64 __pgt_buf_top = 0;
> +/*
> + * As a consequence of the commit:
> + * 
> + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> + * Author: Yinghai Lu <yinghai@xxxxxxxxxx>
> + * Date:   Fri Dec 17 16:58:28 2010 -0800
> + * 
> + *     x86-64, mm: Put early page table high
> + * 
> + * at some point init_memory_mapping is going to reach the pagetable pages
> + * area and map those pages too (mapping them as normal memory that falls
> + * in the range of addresses passed to init_memory_mapping as argument).
> + * Some of those pages are already pagetable pages (they are in the range
> + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> + * everything is fine.
> + * Some of these pages are not pagetable pages yet (they fall in the range
> + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> + * are going to be mapped RW.  When these pages become pagetable pages and
> + * are hooked into the pagetable, xen will find that the guest has already
> + * a RW mapping of them somewhere and fail the operation.
> + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> + * to verify that the pagetables are valid before using them. The validation
> + * operations are called "pinning".
> + * 
> + * In order to fix the issue we mark all the pages in the entire range
> + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> + * ranges are RO).
> + * 
> + * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
> + * the init_memory_mapping has completed (in a perfect world we would
> + * call this function from init_memory_mapping, but lets ignore that).
> + * 
> + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> + * end,top] have all changed to new values (b/c init_memory_mapping
> + * is called and setting up another new page-table). Hence, the first time
> + * we enter this function, we save away the pgt_buf_start value and update
> + * the pgt_buf_[end,top].
> + * 
> + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> + * 
> + * And then we update those "old" pgt_buf_[end|top] with the new ones
> + * so that we can redo this on the next pagetable.
> + */
> +static __init void mark_rw_past_pgt(void) {
> +
> +     if (pgt_buf_end > pgt_buf_start) {
> +             u64 addr, size;
> +
> +             /* Save it away. */
> +             if (!__pgt_buf_start) {
> +                     __pgt_buf_start = pgt_buf_start;
> +                     __pgt_buf_end = pgt_buf_end;
> +                     __pgt_buf_top = pgt_buf_top;
> +                     return;
> +             }
> +             /* If we get the range that starts at __pgt_buf_end that means
> +              * the range is reserved, and that in 'init_memory_mapping'
> +              * the 'memblock_x86_reserve_range' has been called with the
> +              * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> +              * pgt_buf_[start|end|top] refer now to a new pagetable.
> +              * Note: we are called _after_ the pgt_buf_[..] have been
> +              * updated.*/
> +
> +             addr = 
> memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start),
> +                                                    &size, PAGE_SIZE);
> +
> +             /* Still not reserved, meaning 'memblock_x86_reserve_range'
> +              * hasn't been called yet. Update the _end and _top.*/
> +             if (addr == PFN_PHYS(__pgt_buf_start)) {
> +                     __pgt_buf_end = pgt_buf_end;
> +                     __pgt_buf_top = pgt_buf_top;
> +                     return;
> +             }
> +
> +             /* OK, the area is reserved, meaning it is time for us to
> +              * set RW for the old end->top PFNs. */
> +
> +             /* ..unless we had already done this. */
> +             if (__pgt_buf_end == __last_pgt_set_rw)
> +                     return;
> +
> +             addr = PFN_PHYS(__pgt_buf_end);
> +             
> +             /* set as RW the rest */
> +             printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> +                     PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> +             
> +             while (addr < PFN_PHYS(__pgt_buf_top)) {
> +                     make_lowmem_page_readwrite(__va(addr));
> +                     addr += PAGE_SIZE;
> +             }
> +             /* And update everything so that we are ready for the next
> +              * pagetable (the one created for regions past 4GB) */
> +             __last_pgt_set_rw = __pgt_buf_end;
> +             __pgt_buf_start = pgt_buf_start;
> +             __pgt_buf_end = pgt_buf_end;
> +             __pgt_buf_top = pgt_buf_top;
> +     }
> +     return;
> +}
> +#else
> +static __init void mark_rw_past_pgt(void) { }
> +#endif
>  static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
>  {
>  #ifdef CONFIG_X86_64
> @@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
>       unsigned long pfn = pte_pfn(pte);
>  
>       /*
> +      * A bit of optimization. We do not need to call the workaround
> +      * when xen_set_pte_init is called with a PTE with 0 as PFN.
> +      * That is b/c the pagetable at that point are just being populated
> +      * with empty values and we can save some cycles by not calling
> +      * the 'memblock' code.*/
> +     if (pfn)
> +             mark_rw_past_pgt();
> +     /*
>        * If the new pfn is within the range of the newly allocated
>        * kernel pagetable, and it isn't being mapped into an
>        * early_ioremap fixmap slot as a freshly allocated page, make sure
> @@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)
>  
>  static __init void xen_post_allocator_init(void)
>  {
> +     mark_rw_past_pgt();
> +
>  #ifdef CONFIG_XEN_DEBUG
>       pv_mmu_ops.make_pte = PV_CALLEE_SAVE(xen_make_pte_debug);
>  #endif



Unless I am missing something there is no guarantee that somebody else
won't use memory in the pgt_buf_end-pgt_buf_top range when the range is
still RO before mark_rw_past_pgt() is called again.  If so this code
works by coincidence, that is the reason why I didn't try to reuse the
pagetable_setup_done or the pagetable_setup_start hooks.

In any case this code looks very ugly and fragile, do we really want to
add a workaround as bad as this one rather than reverting the original
commit? I think it creates a bad precedent.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put ear