Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dan,

Thanks for your comments. I am not sure about which splintering overheadyou are referring to. I can think of three areas:


1. splintering in page allocation

In this case, Xen fails to allocate requested page order. So it fallsback to smaller pages to setup p2m table. The overhead isO(guest_mem_size), which is a one-time deal.


2. P2M splits large page into smaller pages

This is one directional because we don't merge smaller pages to largeones. The worst case is to split all guest large pages. So overhead isO(total_large_page_mem). In long run, the overhead will converge to 0because it is one-directional. Note this overhead also covers when PoDfeature is enabled.


3. CPU splintering

If CPU does not support 1GB page, it automatically does splinteringusing smaller ones (such as 2MB). In this case, the overhead is alwaysthere. But 1) this only happens to a small number of old chips; 2) Ibelieve that it is still faster than 4K pages. CPUID (1gb feature and1gb TLB entries) can be used to detect and stop this problem, if wedon't really like it.

I agree on your concerns. Customers should have the right to make theirown decision. But that require new feature is enabled in the firstplace. For a lot of benchmarks, splintering overhead can be offset withbenefits of huge pages. SPECJBB is a good example of using large pages(see Ben Serebrin's presentation in Xen Summit). With that said, I agreewith the idea of adding a new option in guest configure file.


-Wei


Dan Magenheimer wrote:

I'd like to reiterate my argument raised in a previous
discussion of hugepages:  Just because this CAN be made
to work, doesn't imply that it SHOULD be made to work.
Real users use larger pages in their OS for the sole
reason that they expect a performance improvement.
If it magically works, but works slow (and possibly
slower than if the OS had just used small pages to
start with), this is likely to lead to unsatisfied
customers, and perhaps allegations such as "Xen sucks
when running databases".

So, please, let's think this through before implementing
it just because we can.  At a minimum, an administrator
should be somehow warned if large pages are getting splintered.

And if its going in over my objection, please tie it to
a boot option that defaults off so administrator action
is required to allow silent splintering.

My two cents...
Dan

-----Original Message-----
From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx]
Sent: Thursday, March 19, 2009 2:07 AM
To: George Dunlap

Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;Tim Deegan

Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Here are patches using the middle approach. It handles 1GBpages in PoD

by remapping 1GB with 2MB pages & retry. I also added code for 1GB
detection. Please comment.

Thanks a lot,

-Wei

-----Original Message-----
From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On Behalf Of George
Dunlap
Sent: Wednesday, March 18, 2009 12:20 PM
To: Huang2, Wei

Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;Tim Deegan

Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Thanks for doing this work, Wei -- especially all the extra effort for
the PoD integration.

One question: How well would you say you've tested the PoD
functionality?  Or to put it the other way, how much do I need to
prioritize testing this before the 3.4 release?

It wouldn't be a bad idea to do as you suggested, and break things
into 2 meg pages for the PoD case.  In order to take the best
advantage of this in a PoD scenario, you'd need to have a balloon
driver that could allocate 1G of continuous *guest* p2m space, which
seems a bit optimistic at this point...

 -George

2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>:

Current Xen supports 2MB super pages for NPT/EPT. The

attached patches

extend this feature to support 1GB pages. The PoD

(populate-on-demand)

introduced by George Dunlap made P2M modification harder. I tried to
preserve existing PoD design by introducing a 1GB PoD cache list.



Note that 1GB PoD can be dropped if we don't care about 1GB when PoD

is

enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE

entries

and grab pages from PoD super list. That can pretty much make
1gb_p2m_pod.patch go away.



Any comment/suggestion on design idea will be appreciated.



Thanks,



-Wei





The following is the description:

=== 1gb_tools.patch ===

Extend existing setup_guest() function. Basically, it tries to

allocate 1GB

pages whenever available. If this request fails, it falls

back to 2MB.
If

both fail, then 4KB pages will be used.



=== 1gb_p2m.patch ===

* p2m_next_level()

Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we

split 1GB

into 512 2MB pages.



* p2m_set_entry()

Configure the PSE bit of L3 P2M table if page order == 18 (1GB).



* p2m_gfn_to_mfn()

Add support for 1GB case when doing gfn to mfn translation. When L3

entry is

marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().

Otherwise,

we do the regular address translation (gfn ==> mfn).



* p2m_gfn_to_mfn_current()

This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
POPULATE_ON_DEMAND, it demands a populate using

p2m_pod_demand_populate().

Otherwise, it does a normal translation. 1GB page is taken into
consideration.



* set_p2m_entry()

Request 1GB page



* audit_p2m()

Support 1GB while auditing p2m table.



* p2m_change_type_global()

Deal with 1GB page when changing global page type.



=== 1gb_p2m_pod.patch ===

* xen/include/asm-x86/p2m.h

Minor change to deal with PoD. It separates super page

cache list into
2MB

and 1GB lists. Similarly, we record last gpfn of sweeping

for both 2MB
and

1GB.



* p2m_pod_cache_add()

Check page order and add 1GB super page into PoD 1GB cache list.



* p2m_pod_cache_get()

Grab a page from cache list. It tries to break 1GB page into 512 2MB

pages

if 2MB PoD list is empty. Similarly, 4KB can be requested from super

pages.

The breaking order is 2MB then 1GB.



* p2m_pod_cache_target()

This function is used to set PoD cache size. To increase PoD target,

we try

to allocate 1GB from xen domheap. If this fails, we try 2MB. If both

fail,

we try 4KB which is guaranteed to work.



To decrease the target, we use a similar approach. We first try to

free 1GB

pages from 1GB PoD cache list. If such request fails, we try 2MB PoD

cache

list. If both fail, we try 4KB list.



* p2m_pod_zero_check_superpage_1gb()

This adds a new function to check for 1GB page. This function is

similar to

p2m_pod_zero_check_superpage_2mb().



* p2m_pod_zero_check_superpage_1gb()
We add a new function to sweep 1GB page from guest memory.

This is the
same

as p2m_pod_zero_check_superpage_2mb().



* p2m_pod_demand_populate()

The trick of this function is to do remap_and_retry if

p2m_pod_cache_get()

fails. When p2m_pod_get() fails, this function will splits p2m table

entry

into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can

guarantee

populate demands always work.





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support