Keir Fraser wrote:
On 28 Jul 2006, at 16:51, Ian Pratt wrote:
So, in summary, we know writable page tables are not broken, they just
don't help on typical workloads because the PTEs/page are so low.
However, they do hurt SMP guest performance. If we are not seeing a
benefit today, should we turn it off? Should we make it a compile
time
option, with the default off?
I wouldn't mind seeing wrpt removed altogether, or at least emulation
made the compile time default for the moment. There's bound to be some
workload that bites us in the future which is why batching updates on
the fork path mightn't be a bad thing if it can be done without too much
gratuitous hacking of linux core code.
My only fear is that batched wrpt has some guest-visible effects. For
example, the guest has to be able to cope with seeing page directory
entries with the present bit cleared. Also, on SMP, it has to be able
to cope with spurious page faults anywhere in its address space (e.g.,
faults on a unhooked page table which some other VCPU has rehooked by
the time the Xen pagefault handler runs, hence the fault is bounced
back to the guest even though there is no work to be done). If we turn
off batched wrpt then guests will not be tested against it and we are
likely to hit problems if we ever want to turn it back on again --
we'll find that some guests are not able to correctly handle the weird
side effects.
On the other hand, perhaps we can find a neater more explicit
alternative to batched wrpt in future.
This is a very nice win for shadow page tables on SMP. Basically, we
use the lazy state information to defer all the MMU hypercalls into a
single flush, which happens when leaving lazy MMU mode.
At the PT level, this can be done without gratuitous hacking of linux
core code. However, this can not be extended safely to also encompass
the set of the parent page directory entry for SMP. It is a little
unclear exactly how this would work under a direct page table hypervisor
- would you still take the faults, or would you re-type and reprotect
the pages first? In the fork case, there can be two page tables being
updated because of COW, but re-typing both pages changes the crossover
point for when the batching will be a win. But if the same hooks can be
used for direct mode, it makes sense to figure that out now so we don't
have to add 4 different sets of hooks to Linux (UP / SMP want slightly
different batching models, as might also shadow/direct).
The PDE p-bit going missing is still a problem, and Linux can be changed
to deal with that - but it is messy.
One remaining issue for deferring direct page table updates is the read
hazard potential. I believe there is only one read hazard in the Linux
mm code that has the potential to be exposed here - the explicit, rather
than implicit batching makes it quite a bit easier to reason about that.
Zach
Implement lazy MMU update hooks which are SMP safe for both direct and
shadow page tables. The idea is that PTE updates and page invalidations
while in lazy mode can be batched into a single hypercall. We use this
in VMI for shadow page table synchronization, and it is a win.
For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified. This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of. Doing this under the protection of the locks guarantees
the synchronization is correct, and also means that spurious faults which
are generated during this window by remote CPUs are properly handled, as
the page fault handler must re-check the PTE under protection of the same
lock.
Signed-off-by: Zachary Amsden <zach@xxxxxxxxxx>
Index: linux-2.6.18-rc2/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.18-rc2.orig/include/asm-generic/pgtable.h 2006-07-28
14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/include/asm-generic/pgtable.h 2006-07-28
14:18:03.000000000 -0700
@@ -163,6 +163,11 @@ static inline void ptep_set_wrprotect(st
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif
+#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#define arch_enter_lazy_mmu_mode() do {} while (0)
+#define arch_leave_lazy_mmu_mode() do {} while (0)
+#endif
+
/*
* When walking page tables, get the address of the next boundary,
* or the end address of the range if that comes earlier. Although no
Index: linux-2.6.18-rc2/mm/memory.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/memory.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/memory.c 2006-07-28 14:18:44.000000000 -0700
@@ -506,6 +506,7 @@ again:
src_ptl = pte_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+ arch_enter_lazy_mmu_mode();
do {
/*
* We are holding two locks at this point - either of them
@@ -525,6 +526,7 @@ again:
copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(src_pte - 1);
@@ -627,6 +629,7 @@ static unsigned long zap_pte_range(struc
int anon_rss = 0;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
do {
pte_t ptent = *pte;
if (pte_none(ptent)) {
@@ -692,6 +695,7 @@ static unsigned long zap_pte_range(struc
pte_clear_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+ arch_leave_lazy_mmu_mode();
add_mm_rss(mm, file_rss, anon_rss);
pte_unmap_unlock(pte - 1, ptl);
@@ -1108,6 +1112,7 @@ static int zeromap_pte_range(struct mm_s
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
+ arch_enter_lazy_mmu_mode();
do {
struct page *page = ZERO_PAGE(addr);
pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
@@ -1117,6 +1122,7 @@ static int zeromap_pte_range(struct mm_s
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, zero_pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
return 0;
}
@@ -1269,11 +1275,13 @@ static int remap_pte_range(struct mm_str
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
+ arch_enter_lazy_mmu_mode();
do {
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
return 0;
}
Index: linux-2.6.18-rc2/mm/mprotect.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mprotect.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mprotect.c 2006-07-28 14:17:25.000000000 -0700
@@ -33,6 +33,7 @@ static void change_pte_range(struct mm_s
spinlock_t *ptl;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
do {
oldpte = *pte;
if (pte_present(oldpte)) {
@@ -62,6 +63,7 @@ static void change_pte_range(struct mm_s
}
} while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
}
Index: linux-2.6.18-rc2/mm/mremap.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mremap.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mremap.c 2006-07-28 14:17:25.000000000 -0700
@@ -99,6 +99,7 @@ static void move_ptes(struct vm_area_str
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+ arch_enter_lazy_mmu_mode();
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
if (pte_none(*old_pte))
@@ -108,6 +109,7 @@ static void move_ptes(struct vm_area_str
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
set_pte_at(mm, new_addr, new_pte, pte);
}
+ arch_leave_lazy_mmu_mode();
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
Index: linux-2.6.18-rc2/mm/msync.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/msync.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/msync.c 2006-07-28 14:17:25.000000000 -0700
@@ -30,6 +30,7 @@ static unsigned long msync_pte_range(str
again:
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
do {
struct page *page;
@@ -51,6 +52,7 @@ again:
ret += set_page_dirty(page);
progress += 3;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
if (addr != end)
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|