Interesting. And non-intuitive. I think you are saying
that, at least theoretically (and using your ABCD, not
my ABC below), A is always faster than
(B | C), and (B | C) is always faster than D. Taking into
account the fact that the TLB size is fixed (I think),
C will always be faster than B and never slower than D.
So if the theory proves true, that does seem to eliminate
my objection.
Thanks,
Dan
> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxxxxx]
> Sent: Friday, March 20, 2009 3:46 AM
> To: Dan Magenheimer
> Cc: Wei Huang; xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>
>
> Dan,
>
> Don't forget that this is about the p2m table, which is (if I
> understand
> correctly) orthogonal to what the guest pagetables are doing. So the
> scenario, if HAP is used, would be:
>
> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages,
> guest PTs
> use 2MB pages, P2M uses 2MB pages
> - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
> - A tlb miss requires 3 * 4 = 12 reads
> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
> - A tlb miss requires 4 * 3 = 12 reads
> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
> - A tlb miss requires 4 * 4 = 16 reads
>
> And adding the 1G p2m entries will change the multiplier from 3 to 2
> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k
> guest pages).
>
> (Those who are more familiar with the hardware, please correct me if
> I've made some mistakes or oversimplified things.)
>
> So adding 1G pages to the p2m table shouldn't change
> expectations of the
> guest OS in any case. Using it will benefit the guest to the same
> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I
> understand
> correctly.)
>
> -George
>
> Dan Magenheimer wrote:
> > Hi Wei --
> >
> > I'm not worried about the overhead of the splintering, I'm
> > worried about the "hidden overhead" everytime a "silent
> > splinter" is used.
> >
> > Let's assume three scenarios (and for now use 2MB pages though
> > the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
> >
> > A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> > only 2MB pages (no splintering occurs)
> > B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> > only 4KB pages (because of fragmentation, all 2MB pages have
> > been splintered)
> > C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
> > 4KB pages
> >
> > Now run some benchmarks. Clearly one would assume that A is
> > faster than both B and C. The question is: Is B faster or slower
> > than C?
> >
> > If B is always faster than C, then I have less objection to
> > "silent splintering". But if B is sometimes (or maybe always?)
> > slower than C, that's a big issue because a user has gone through
> > the effort of choosing a better-performing system configuration
> > for their software (2MB DB on 2MB OS), but it actually performs
> > worse than if they had chosen the "lower performing" configuration.
> > And, worse, it will likely degrade across time so performance
> > might be fine when the 2MB-DB-on-2MB-OS guest is launched
> > but get much worse when it is paused, save/restored, migrated,
> > or hot-failed. So even if B is only slightly faster than C,
> > if B is much slower than A, this is a problem.
> >
> > Does that make sense?
> >
> > Some suggestions:
> > 1) If it is possible for an administrator to determine how many
> > large pages (both 2MB and 1GB) were requested by each domain
> > and how many are currently whole-vs-splintered, that would help.
> > 2) We may need some form of memory defragmenter
> >
> >
> >> -----Original Message-----
> >> From: Wei Huang [mailto:wei.huang2@xxxxxxx]
> >> Sent: Thursday, March 19, 2009 12:52 PM
> >> To: Dan Magenheimer
> >> Cc: George Dunlap; xen-devel@xxxxxxxxxxxxxxxxxxx;
> >> keir.fraser@xxxxxxxxxxxxx; Tim Deegan
> >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> >>
> >>
> >> Dan,
> >>
> >> Thanks for your comments. I am not sure about which
> >> splintering overhead
> >> you are referring to. I can think of three areas:
> >>
> >> 1. splintering in page allocation
> >> In this case, Xen fails to allocate requested page order.
> So it falls
> >> back to smaller pages to setup p2m table. The overhead is
> >> O(guest_mem_size), which is a one-time deal.
> >>
> >> 2. P2M splits large page into smaller pages
> >> This is one directional because we don't merge smaller
> pages to large
> >> ones. The worst case is to split all guest large pages. So
> >> overhead is
> >> O(total_large_page_mem). In long run, the overhead will
> converge to 0
> >> because it is one-directional. Note this overhead also covers
> >> when PoD
> >> feature is enabled.
> >>
> >> 3. CPU splintering
> >> If CPU does not support 1GB page, it automatically does
> splintering
> >> using smaller ones (such as 2MB). In this case, the overhead
> >> is always
> >> there. But 1) this only happens to a small number of old
> chips; 2) I
> >> believe that it is still faster than 4K pages. CPUID (1gb
> feature and
> >> 1gb TLB entries) can be used to detect and stop this
> problem, if we
> >> don't really like it.
> >>
> >> I agree on your concerns. Customers should have the right to
> >> make their
> >> own decision. But that require new feature is enabled in the first
> >> place. For a lot of benchmarks, splintering overhead can be
> >> offset with
> >> benefits of huge pages. SPECJBB is a good example of using
> >> large pages
> >> (see Ben Serebrin's presentation in Xen Summit). With that
> >> said, I agree
> >> with the idea of adding a new option in guest configure file.
> >>
> >> -Wei
> >>
> >>
> >> Dan Magenheimer wrote:
> >>
> >>> I'd like to reiterate my argument raised in a previous
> >>> discussion of hugepages: Just because this CAN be made
> >>> to work, doesn't imply that it SHOULD be made to work.
> >>> Real users use larger pages in their OS for the sole
> >>> reason that they expect a performance improvement.
> >>> If it magically works, but works slow (and possibly
> >>> slower than if the OS had just used small pages to
> >>> start with), this is likely to lead to unsatisfied
> >>> customers, and perhaps allegations such as "Xen sucks
> >>> when running databases".
> >>>
> >>> So, please, let's think this through before implementing
> >>> it just because we can. At a minimum, an administrator
> >>> should be somehow warned if large pages are getting splintered.
> >>>
> >>> And if its going in over my objection, please tie it to
> >>> a boot option that defaults off so administrator action
> >>> is required to allow silent splintering.
> >>>
> >>> My two cents...
> >>> Dan
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx]
> >>>> Sent: Thursday, March 19, 2009 2:07 AM
> >>>> To: George Dunlap
> >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;
> >>>> Tim Deegan
> >>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page
> Table Support
> >>>>
> >>>>
> >>>> Here are patches using the middle approach. It handles 1GB
> >>>> pages in PoD
> >>>> by remapping 1GB with 2MB pages & retry. I also added
> code for 1GB
> >>>> detection. Please comment.
> >>>>
> >>>> Thanks a lot,
> >>>>
> >>>> -Wei
> >>>>
> >>>> -----Original Message-----
> >>>> From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On
> >>>>
> >> Behalf Of George
> >>
> >>>> Dunlap
> >>>> Sent: Wednesday, March 18, 2009 12:20 PM
> >>>> To: Huang2, Wei
> >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;
> >>>> Tim Deegan
> >>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page
> Table Support
> >>>>
> >>>> Thanks for doing this work, Wei -- especially all the
> >>>>
> >> extra effort for
> >>
> >>>> the PoD integration.
> >>>>
> >>>> One question: How well would you say you've tested the PoD
> >>>> functionality? Or to put it the other way, how much do I need to
> >>>> prioritize testing this before the 3.4 release?
> >>>>
> >>>> It wouldn't be a bad idea to do as you suggested, and
> break things
> >>>> into 2 meg pages for the PoD case. In order to take the best
> >>>> advantage of this in a PoD scenario, you'd need to have a balloon
> >>>> driver that could allocate 1G of continuous *guest* p2m
> >>>>
> >> space, which
> >>
> >>>> seems a bit optimistic at this point...
> >>>>
> >>>> -George
> >>>>
> >>>> 2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>:
> >>>>
> >>>>> Current Xen supports 2MB super pages for NPT/EPT. The
> >>>>>
> >>>> attached patches
> >>>>
> >>>>> extend this feature to support 1GB pages. The PoD
> >>>>>
> >>>> (populate-on-demand)
> >>>>
> >>>>> introduced by George Dunlap made P2M modification harder.
> >>>>>
> >> I tried to
> >>
> >>>>> preserve existing PoD design by introducing a 1GB PoD
> cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Note that 1GB PoD can be dropped if we don't care about
> >>>>>
> >> 1GB when PoD
> >>
> >>>> is
> >>>>
> >>>>> enabled. In this case, we can just split 1GB PDPE into
> 512x2MB PDE
> >>>>>
> >>>> entries
> >>>>
> >>>>> and grab pages from PoD super list. That can pretty much make
> >>>>> 1gb_p2m_pod.patch go away.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Any comment/suggestion on design idea will be appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Wei
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> The following is the description:
> >>>>>
> >>>>> === 1gb_tools.patch ===
> >>>>>
> >>>>> Extend existing setup_guest() function. Basically, it tries to
> >>>>>
> >>>> allocate 1GB
> >>>>
> >>>>> pages whenever available. If this request fails, it falls
> >>>>>
> >>>> back to 2MB.
> >>>> If
> >>>>
> >>>>> both fail, then 4KB pages will be used.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m.patch ===
> >>>>>
> >>>>> * p2m_next_level()
> >>>>>
> >>>>> Check PSE bit of L3 page table entry. If 1GB is found
> (PSE=1), we
> >>>>>
> >>>> split 1GB
> >>>>
> >>>>> into 512 2MB pages.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_set_entry()
> >>>>>
> >>>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn()
> >>>>>
> >>>>> Add support for 1GB case when doing gfn to mfn
> >>>>>
> >> translation. When L3
> >>
> >>>> entry is
> >>>>
> >>>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
> >>>>>
> >>>> Otherwise,
> >>>>
> >>>>> we do the regular address translation (gfn ==> mfn).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn_current()
> >>>>>
> >>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> >>>>> POPULATE_ON_DEMAND, it demands a populate using
> >>>>>
> >>>> p2m_pod_demand_populate().
> >>>>
> >>>>> Otherwise, it does a normal translation. 1GB page is taken into
> >>>>> consideration.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * set_p2m_entry()
> >>>>>
> >>>>> Request 1GB page
> >>>>>
> >>>>>
> >>>>>
> >>>>> * audit_p2m()
> >>>>>
> >>>>> Support 1GB while auditing p2m table.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_change_type_global()
> >>>>>
> >>>>> Deal with 1GB page when changing global page type.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m_pod.patch ===
> >>>>>
> >>>>> * xen/include/asm-x86/p2m.h
> >>>>>
> >>>>> Minor change to deal with PoD. It separates super page
> >>>>>
> >>>> cache list into
> >>>> 2MB
> >>>>
> >>>>> and 1GB lists. Similarly, we record last gpfn of sweeping
> >>>>>
> >>>> for both 2MB
> >>>> and
> >>>>
> >>>>> 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_add()
> >>>>>
> >>>>> Check page order and add 1GB super page into PoD 1GB cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_get()
> >>>>>
> >>>>> Grab a page from cache list. It tries to break 1GB page
> >>>>>
> >> into 512 2MB
> >>
> >>>> pages
> >>>>
> >>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested
> >>>>>
> >> from super
> >>
> >>>> pages.
> >>>>
> >>>>> The breaking order is 2MB then 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_target()
> >>>>>
> >>>>> This function is used to set PoD cache size. To increase
> >>>>>
> >> PoD target,
> >>
> >>>> we try
> >>>>
> >>>>> to allocate 1GB from xen domheap. If this fails, we try
> >>>>>
> >> 2MB. If both
> >>
> >>>> fail,
> >>>>
> >>>>> we try 4KB which is guaranteed to work.
> >>>>>
> >>>>>
> >>>>>
> >>>>> To decrease the target, we use a similar approach. We
> first try to
> >>>>>
> >>>> free 1GB
> >>>>
> >>>>> pages from 1GB PoD cache list. If such request fails, we
> >>>>>
> >> try 2MB PoD
> >>
> >>>> cache
> >>>>
> >>>>> list. If both fail, we try 4KB list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> This adds a new function to check for 1GB page. This function is
> >>>>>
> >>>> similar to
> >>>>
> >>>>> p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> We add a new function to sweep 1GB page from guest memory.
> >>>>>
> >>>> This is the
> >>>> same
> >>>>
> >>>>> as p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_demand_populate()
> >>>>>
> >>>>> The trick of this function is to do remap_and_retry if
> >>>>>
> >>>> p2m_pod_cache_get()
> >>>>
> >>>>> fails. When p2m_pod_get() fails, this function will
> >>>>>
> >> splits p2m table
> >>
> >>>> entry
> >>>>
> >>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
> >>>>>
> >>>> guarantee
> >>>>
> >>>>> populate demands always work.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Xen-devel mailing list
> >>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>> http://lists.xensource.com/xen-devel
> >>>>>
> >>>>>
> >>>>>
> >>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|