Xen project Mailing List

Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special

To: David Hildenbrand <david@xxxxxxxxxx>

From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>

Date: Fri, 18 Jul 2025 11:41:07 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=oXhln8PObnl1zreY0S3Dze7FF9fLf3SovbnLjYZAvko=; b=DRPOyUfqbZkanTQiyGsRCRT3ApOSX4HpNud4uoFjJvQJefhI/UDPcTFa2W9/MucFkvDbnlDVeDjQX9Rg0DvVgmInabFxBRRRZVxUiJ3K/sIrrK5Ggx/t1QutNqGiuLtDJPW2jD+mz975f7NV7RILM5EvKq4PWf7EOrJO0uPJ28imG/0+lbu2+mE9AAJmQtpHqCdTvE41sSAXcjWKmJFRZU89TBWa/aJWA5aP6Xf5oubE7Dx/+xitqALSEQ0oNgvM+ZzHMsE301s+m9bP9d/oDNzlxBX69aU3eGDo52G9khLj0QX4z466Rnl2WUC/UTw0zUhG16Bkgqq2dHBJVYJivQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=LlHm/4RswFZarGgEMySvyvNBlRRcloBg1PcSbAdvPL9Eji9Bmx7+/jdwvJZnyhIZkxOGp+umdCEy7qthWtkUb2/PAi4JZQfT2zXQtPNidO82saSjhiF7nMpuLfYU5PUSDTLn1PKHhmpOpj1Mk0OyknkH4IVmAm78iVE3lx+Upa6mpGSGOnSD/KRAfsm0KeL4BDBMQ0LATkxuuMQbPOmRRURMIqjH6KZVVLV1gu/C8keQBUEK0E83wHT55AhBbBANTxJTzJeczTw1r3BCwDu9R6XJvoH79dtbrDRm2YIyvHjothBQPOdn0pCri+btzCIY/a7MaUoq1HSbh5p3KTf6lw==

Cc: linux-kernel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, nvdimm@xxxxxxxxxxxxxxx, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>, Dan Williams <dan.j.williams@xxxxxxxxx>, Matthew Wilcox <willy@xxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Christian Brauner <brauner@xxxxxxxxxx>, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxx>, Suren Baghdasaryan <surenb@xxxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxx>, Zi Yan <ziy@xxxxxxxxxx>, Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>, Nico Pache <npache@xxxxxxxxxx>, Ryan Roberts <ryan.roberts@xxxxxxx>, Dev Jain <dev.jain@xxxxxxx>, Barry Song <baohua@xxxxxxxxxx>, Jann Horn <jannh@xxxxxxxxxx>, Pedro Falcato <pfalcato@xxxxxxx>, Hugh Dickins <hughd@xxxxxxxxxx>, Oscar Salvador <osalvador@xxxxxxx>, Lance Yang <lance.yang@xxxxxxxxx>

Delivery-date: Fri, 18 Jul 2025 10:41:50 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote: > On 17.07.25 20:29, Lorenzo Stoakes wrote: > > On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote: > > > The huge zero folio is refcounted (+mapcounted -- is that a word?) > > > differently than "normal" folios, similarly (but different) to the > > > ordinary > > > shared zeropage. > > > > Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero > > pages? > > I wish we could get rid of the weird refcounting of the huge zero folio and > get rid of the shrinker. But as long as the shrinker exists, I'm afraid that > weird per-process refcounting must stay. Does this change of yours cause any issue with it? I mean now nothing can grab this page using vm_normal_page_pmd(), so it won't be able to manipulate refcounts. Does this impact the shrink stuff at all? Haven't traced it all through. > > > > > But for some reason the huge zero page wants to exist or not exist based on > > usage for one. Which is stupid to me. > > Yes, I will try at some point (once we have the static huge zero folio) to > remove the shrinker part and make it always static. Well, at least for > reasonable architectures. Yes, this seems the correct way to move forward. And we need to get rid of (or neuter) the 'unreasonable' architectures... > > > > > > > > > For this reason, we special-case these pages in > > > vm_normal_page*/vm_normal_folio*, and only allow selected callers to > > > still use them (e.g., GUP can still take a reference on them). > > > > > > vm_normal_page_pmd() already filters out the huge zero folio. However, > > > so far we are not marking it as special like we do with the ordinary > > > shared zeropage. Let's mark it as special, so we can further refactor > > > vm_normal_page_pmd() and vm_normal_page(). > > > > > > While at it, update the doc regarding the shared zero folios. > > > > Hmm I wonder how this will interact with the static PMD series at [0]? > > No, it shouldn't. I'm always nervous about these kinds of things :) I'm assuming the reference/map counting will still work properly with the static page? I think I raised that in that series :P > > > > > I wonder if more use of that might result in some weirdness with refcounting > > etc.? > > I don't think so. Good :>) > > > > > Also, that series was (though I reviewed against it) moving stuff that > > references the huge zero folio out of there, but also generally allows > > access and mapping of this folio via largest_zero_folio() so not only via > > insert_pmd(). > > > > So we're going to end up with mappings of this that are not marked special > > that are potentially going to have refcount/mapcount manipulation that > > contradict what you're doing here perhaps? > > I don't think so. It's just like having the existing static (small) shared > zeropage where the same rules about refcounting+mapcounting apply. It feels like all this is a mess... am I missing something that would make it all make sense? It's not sane to disappear zero pages based on not-usage in 2025. Sorry if that little RAM hurts you, use a microcontroller... after which it doesn't really mean anything to have ref/map counts... > > > > > [0]: > > https://lore.kernel.org/all/20250707142319.319642-1-kernel@xxxxxxxxxxxxxxxx/ > > > > > > > > Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> > > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > > > > I looked thorugh places that use vm_normal_page_pm() (other than decl of > > function): > > > > fs/proc/task_mmu.c - seems to handle NULL page correctly + still > > undertsands zero page > > mm/pagewalk.c - correctly handles NULL page + huge zero page > > mm/huge_memory.c - can_change_pmd_writable() correctly returns false. > > > > And all seems to work wtih this change. > > > > Overall, other than concerns above + nits below LGTM, we should treat all > > the zero folios the same in this regard, so: > > > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Thanks! No problem! Thanks for the cleanups, these are great... > > > > > > --- > > > mm/huge_memory.c | 5 ++++- > > > mm/memory.c | 14 +++++++++----- > > > 2 files changed, 13 insertions(+), 6 deletions(-) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index db08c37b87077..3f9a27812a590 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, > > > struct mm_struct *mm, > > > { > > > pmd_t entry; > > > entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); > > > + entry = pmd_mkspecial(entry); > > > pgtable_trans_huge_deposit(mm, pmd, pgtable); > > > set_pmd_at(mm, haddr, pmd, entry); > > > mm_inc_nr_ptes(mm); > > > @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct > > > *vma, unsigned long addr, > > > if (fop.is_folio) { > > > entry = folio_mk_pmd(fop.folio, vma->vm_page_prot); > > > > > > - if (!is_huge_zero_folio(fop.folio)) { > > > + if (is_huge_zero_folio(fop.folio)) { > > > + entry = pmd_mkspecial(entry); > > > + } else { > > > folio_get(fop.folio); > > > folio_add_file_rmap_pmd(fop.folio, > > > &fop.folio->page, vma); > > > add_mm_counter(mm, mm_counter_file(fop.folio), > > > HPAGE_PMD_NR); > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 92fd18a5d8d1f..173eb6267e0ac 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct > > > *vma, unsigned long addr, > > > * > > > * "Special" mappings do not wish to be associated with a "struct page" > > > (either > > > * it doesn't exist, or it exists but they don't want to touch it). In > > > this > > > - * case, NULL is returned here. "Normal" mappings do have a struct page. > > > + * case, NULL is returned here. "Normal" mappings do have a struct page > > > and > > > + * are ordinarily refcounted. > > > + * > > > + * Page mappings of the shared zero folios are always considered > > > "special", as > > > + * they are not ordinarily refcounted. However, selected page table > > > walkers > > > + * (such as GUP) can still identify these mappings and work with the > > > + * underlying "struct page". > > > > I feel like we need more detail or something more explicit about what 'not > > ordinary' refcounting constitutes. This is a bit vague. > > Hm, I am not sure this is the correct place to document that. But let me see > if I can come up with something reasonable > > (like, the refcount and mapcount of these folios is never adjusted when > mapping them into page tables) I think having _something_ is good even if perhaps not ideally situated... :) > > > > > > * > > > * There are 2 broad cases. Firstly, an architecture may define a > > > pte_special() > > > * pte bit, in which case this function is trivial. Secondly, an > > > architecture > > > @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, > > > unsigned long addr, > > > * > > > * VM_MIXEDMAP mappings can likewise contain memory with or without > > > "struct > > > * page" backing, however the difference is that _all_ pages with a > > > struct > > > - * page (that is, those where pfn_valid is true) are refcounted and > > > considered > > > - * normal pages by the VM. The only exception are zeropages, which are > > > - * *never* refcounted. > > > + * page (that is, those where pfn_valid is true, except the shared zero > > > + * folios) are refcounted and considered normal pages by the VM. > > > * > > > * The disadvantage is that pages are refcounted (which can be slower > > > and > > > * simply not an option for some PFNMAP users). The advantage is that we > > > @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct > > > *vma, unsigned long addr, > > > > You know I"m semi-ashamed to admit I didn't even know this function > > exists. But yikes that we have a separate function like this just for PMDs. > > It's a bit new-ish :) OK I feel less bad about it then... -ish ;) > > > -- > Cheers, > > David / dhildenb > Cheers, Lorenzo

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.