[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support



Interesting.  And non-intuitive.  I think you are saying
that, at least theoretically (and using your ABCD, not
my ABC below), A is always faster than
(B | C), and (B | C) is always faster than D.  Taking into
account the fact that the TLB size is fixed (I think),
C will always be faster than B and never slower than D.

So if the theory proves true, that does seem to eliminate
my objection.

Thanks,
Dan

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxxxxx]
> Sent: Friday, March 20, 2009 3:46 AM
> To: Dan Magenheimer
> Cc: Wei Huang; xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> 
> Dan,
> 
> Don't forget that this is about the p2m table, which is (if I 
> understand 
> correctly) orthogonal to what the guest pagetables are doing.  So the 
> scenario, if HAP is used, would be:
> 
> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, 
> guest PTs 
> use 2MB pages, P2M uses 2MB pages
>  - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
>  - A tlb miss requires 3 * 4 = 12 reads
> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
>  - A tlb miss requires 4 * 3 = 12 reads
> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
>  - A tlb miss requires 4 * 4 = 16 reads
> 
> And adding the 1G p2m entries will change the multiplier from 3 to 2 
> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k 
> guest pages).
> 
> (Those who are more familiar with the hardware, please correct me if 
> I've made some mistakes or oversimplified things.)
> 
> So adding 1G pages to the p2m table shouldn't change 
> expectations of the 
> guest OS in any case.  Using it will benefit the guest to the same 
> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I 
> understand 
> correctly.)
> 
>  -George
> 
> Dan Magenheimer wrote:
> > Hi Wei --
> >
> > I'm not worried about the overhead of the splintering, I'm
> > worried about the "hidden overhead" everytime a "silent
> > splinter" is used.
> >
> > Let's assume three scenarios (and for now use 2MB pages though
> > the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
> >
> > A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> >    only 2MB pages (no splintering occurs)
> > B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> >    only 4KB pages (because of fragmentation, all 2MB pages have
> >    been splintered)
> > C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
> >    4KB pages
> >
> > Now run some benchmarks.  Clearly one would assume that A is
> > faster than both B and C.  The question is: Is B faster or slower
> > than C?
> >
> > If B is always faster than C, then I have less objection to
> > "silent splintering".  But if B is sometimes (or maybe always?)
> > slower than C, that's a big issue because a user has gone through
> > the effort of choosing a better-performing system configuration
> > for their software (2MB DB on 2MB OS), but it actually performs
> > worse than if they had chosen the "lower performing" configuration.
> > And, worse, it will likely degrade across time so performance
> > might be fine when the 2MB-DB-on-2MB-OS guest is launched
> > but get much worse when it is paused, save/restored, migrated,
> > or hot-failed.  So even if B is only slightly faster than C,
> > if B is much slower than A, this is a problem.
> >
> > Does that make sense?
> >
> > Some suggestions:
> > 1) If it is possible for an administrator to determine how many
> >    large pages (both 2MB and 1GB) were requested by each domain
> >    and how many are currently whole-vs-splintered, that would help.
> > 2) We may need some form of memory defragmenter
> >
> >   
> >> -----Original Message-----
> >> From: Wei Huang [mailto:wei.huang2@xxxxxxx]
> >> Sent: Thursday, March 19, 2009 12:52 PM
> >> To: Dan Magenheimer
> >> Cc: George Dunlap; xen-devel@xxxxxxxxxxxxxxxxxxx;
> >> keir.fraser@xxxxxxxxxxxxx; Tim Deegan
> >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> >>
> >>
> >> Dan,
> >>
> >> Thanks for your comments. I am not sure about which 
> >> splintering overhead 
> >> you are referring to. I can think of three areas:
> >>
> >> 1. splintering in page allocation
> >> In this case, Xen fails to allocate requested page order. 
> So it falls 
> >> back to smaller pages to setup p2m table. The overhead is 
> >> O(guest_mem_size), which is a one-time deal.
> >>
> >> 2. P2M splits large page into smaller pages
> >> This is one directional because we don't merge smaller 
> pages to large 
> >> ones. The worst case is to split all guest large pages. So 
> >> overhead is 
> >> O(total_large_page_mem). In long run, the overhead will 
> converge to 0 
> >> because it is one-directional. Note this overhead also covers 
> >> when PoD 
> >> feature is enabled.
> >>
> >> 3. CPU splintering
> >> If CPU does not support 1GB page, it automatically does 
> splintering 
> >> using smaller ones (such as 2MB). In this case, the overhead 
> >> is always 
> >> there. But 1) this only happens to a small number of old 
> chips; 2) I 
> >> believe that it is still faster than 4K pages. CPUID (1gb 
> feature and 
> >> 1gb TLB entries) can be used to detect and stop this 
> problem, if we 
> >> don't really like it.
> >>
> >> I agree on your concerns. Customers should have the right to 
> >> make their 
> >> own decision. But that require new feature is enabled in the first 
> >> place. For a lot of benchmarks, splintering overhead can be 
> >> offset with 
> >> benefits of huge pages. SPECJBB is a good example of using 
> >> large pages 
> >> (see Ben Serebrin's presentation in Xen Summit). With that 
> >> said, I agree 
> >> with the idea of adding a new option in guest configure file.
> >>
> >> -Wei
> >>
> >>
> >> Dan Magenheimer wrote:
> >>     
> >>> I'd like to reiterate my argument raised in a previous
> >>> discussion of hugepages:  Just because this CAN be made
> >>> to work, doesn't imply that it SHOULD be made to work.
> >>> Real users use larger pages in their OS for the sole
> >>> reason that they expect a performance improvement.
> >>> If it magically works, but works slow (and possibly
> >>> slower than if the OS had just used small pages to
> >>> start with), this is likely to lead to unsatisfied
> >>> customers, and perhaps allegations such as "Xen sucks
> >>> when running databases".
> >>>
> >>> So, please, let's think this through before implementing
> >>> it just because we can.  At a minimum, an administrator
> >>> should be somehow warned if large pages are getting splintered.
> >>>
> >>> And if its going in over my objection, please tie it to
> >>> a boot option that defaults off so administrator action
> >>> is required to allow silent splintering.
> >>>
> >>> My two cents...
> >>> Dan
> >>>
> >>>       
> >>>> -----Original Message-----
> >>>> From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx]
> >>>> Sent: Thursday, March 19, 2009 2:07 AM
> >>>> To: George Dunlap
> >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; 
> >>>> Tim Deegan
> >>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page 
> Table Support
> >>>>
> >>>>
> >>>> Here are patches using the middle approach. It handles 1GB 
> >>>> pages in PoD
> >>>> by remapping 1GB with 2MB pages & retry. I also added 
> code for 1GB
> >>>> detection. Please comment.
> >>>>
> >>>> Thanks a lot,
> >>>>
> >>>> -Wei
> >>>>
> >>>> -----Original Message-----
> >>>> From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On 
> >>>>         
> >> Behalf Of George
> >>     
> >>>> Dunlap
> >>>> Sent: Wednesday, March 18, 2009 12:20 PM
> >>>> To: Huang2, Wei
> >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; 
> >>>> Tim Deegan
> >>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page 
> Table Support
> >>>>
> >>>> Thanks for doing this work, Wei -- especially all the 
> >>>>         
> >> extra effort for
> >>     
> >>>> the PoD integration.
> >>>>
> >>>> One question: How well would you say you've tested the PoD
> >>>> functionality?  Or to put it the other way, how much do I need to
> >>>> prioritize testing this before the 3.4 release?
> >>>>
> >>>> It wouldn't be a bad idea to do as you suggested, and 
> break things
> >>>> into 2 meg pages for the PoD case.  In order to take the best
> >>>> advantage of this in a PoD scenario, you'd need to have a balloon
> >>>> driver that could allocate 1G of continuous *guest* p2m 
> >>>>         
> >> space, which
> >>     
> >>>> seems a bit optimistic at this point...
> >>>>
> >>>>  -George
> >>>>
> >>>> 2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>:
> >>>>         
> >>>>> Current Xen supports 2MB super pages for NPT/EPT. The 
> >>>>>           
> >>>> attached patches
> >>>>         
> >>>>> extend this feature to support 1GB pages. The PoD 
> >>>>>           
> >>>> (populate-on-demand)
> >>>>         
> >>>>> introduced by George Dunlap made P2M modification harder. 
> >>>>>           
> >> I tried to
> >>     
> >>>>> preserve existing PoD design by introducing a 1GB PoD 
> cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Note that 1GB PoD can be dropped if we don't care about 
> >>>>>           
> >> 1GB when PoD
> >>     
> >>>> is
> >>>>         
> >>>>> enabled. In this case, we can just split 1GB PDPE into 
> 512x2MB PDE
> >>>>>           
> >>>> entries
> >>>>         
> >>>>> and grab pages from PoD super list. That can pretty much make
> >>>>> 1gb_p2m_pod.patch go away.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Any comment/suggestion on design idea will be appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Wei
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> The following is the description:
> >>>>>
> >>>>> === 1gb_tools.patch ===
> >>>>>
> >>>>> Extend existing setup_guest() function. Basically, it tries to
> >>>>>           
> >>>> allocate 1GB
> >>>>         
> >>>>> pages whenever available. If this request fails, it falls 
> >>>>>           
> >>>> back to 2MB.
> >>>> If
> >>>>         
> >>>>> both fail, then 4KB pages will be used.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m.patch ===
> >>>>>
> >>>>> * p2m_next_level()
> >>>>>
> >>>>> Check PSE bit of L3 page table entry. If 1GB is found 
> (PSE=1), we
> >>>>>           
> >>>> split 1GB
> >>>>         
> >>>>> into 512 2MB pages.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_set_entry()
> >>>>>
> >>>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn()
> >>>>>
> >>>>> Add support for 1GB case when doing gfn to mfn 
> >>>>>           
> >> translation. When L3
> >>     
> >>>> entry is
> >>>>         
> >>>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
> >>>>>           
> >>>> Otherwise,
> >>>>         
> >>>>> we do the regular address translation (gfn ==> mfn).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn_current()
> >>>>>
> >>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> >>>>> POPULATE_ON_DEMAND, it demands a populate using
> >>>>>           
> >>>> p2m_pod_demand_populate().
> >>>>         
> >>>>> Otherwise, it does a normal translation. 1GB page is taken into
> >>>>> consideration.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * set_p2m_entry()
> >>>>>
> >>>>> Request 1GB page
> >>>>>
> >>>>>
> >>>>>
> >>>>> * audit_p2m()
> >>>>>
> >>>>> Support 1GB while auditing p2m table.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_change_type_global()
> >>>>>
> >>>>> Deal with 1GB page when changing global page type.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m_pod.patch ===
> >>>>>
> >>>>> * xen/include/asm-x86/p2m.h
> >>>>>
> >>>>> Minor change to deal with PoD. It separates super page 
> >>>>>           
> >>>> cache list into
> >>>> 2MB
> >>>>         
> >>>>> and 1GB lists. Similarly, we record last gpfn of sweeping 
> >>>>>           
> >>>> for both 2MB
> >>>> and
> >>>>         
> >>>>> 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_add()
> >>>>>
> >>>>> Check page order and add 1GB super page into PoD 1GB cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_get()
> >>>>>
> >>>>> Grab a page from cache list. It tries to break 1GB page 
> >>>>>           
> >> into 512 2MB
> >>     
> >>>> pages
> >>>>         
> >>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested 
> >>>>>           
> >> from super
> >>     
> >>>> pages.
> >>>>         
> >>>>> The breaking order is 2MB then 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_target()
> >>>>>
> >>>>> This function is used to set PoD cache size. To increase 
> >>>>>           
> >> PoD target,
> >>     
> >>>> we try
> >>>>         
> >>>>> to allocate 1GB from xen domheap. If this fails, we try 
> >>>>>           
> >> 2MB. If both
> >>     
> >>>> fail,
> >>>>         
> >>>>> we try 4KB which is guaranteed to work.
> >>>>>
> >>>>>
> >>>>>
> >>>>> To decrease the target, we use a similar approach. We 
> first try to
> >>>>>           
> >>>> free 1GB
> >>>>         
> >>>>> pages from 1GB PoD cache list. If such request fails, we 
> >>>>>           
> >> try 2MB PoD
> >>     
> >>>> cache
> >>>>         
> >>>>> list. If both fail, we try 4KB list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> This adds a new function to check for 1GB page. This function is
> >>>>>           
> >>>> similar to
> >>>>         
> >>>>> p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> We add a new function to sweep 1GB page from guest memory. 
> >>>>>           
> >>>> This is the
> >>>> same
> >>>>         
> >>>>> as p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_demand_populate()
> >>>>>
> >>>>> The trick of this function is to do remap_and_retry if
> >>>>>           
> >>>> p2m_pod_cache_get()
> >>>>         
> >>>>> fails. When p2m_pod_get() fails, this function will 
> >>>>>           
> >> splits p2m table
> >>     
> >>>> entry
> >>>>         
> >>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
> >>>>>           
> >>>> guarantee
> >>>>         
> >>>>> populate demands always work.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Xen-devel mailing list
> >>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>> http://lists.xensource.com/xen-devel
> >>>>>
> >>>>>
> >>>>>           
> >>     
> 
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.