[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()



On Thu, Jul 17, 2025 at 01:52:09PM +0200, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
>
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.

Oh god. I reviewed this below. BUT OH GOD. What. Why???

>
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
>  mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 94 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 173eb6267e0ac..08d16ed7b4cc7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, 
> int *rss)
>                       add_mm_counter(mm, i, rss[i]);
>  }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> -                       pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
>  {
> -     pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> -     p4d_t *p4d = p4d_offset(pgd, addr);
> -     pud_t *pud = pud_offset(p4d, addr);
> -     pmd_t *pmd = pmd_offset(pud, addr);
> -     struct address_space *mapping;
> -     pgoff_t index;
>       static unsigned long resume;
>       static unsigned long nr_shown;
>       static unsigned long nr_unshown;
> @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>       if (nr_shown == 60) {
>               if (time_before(jiffies, resume)) {
>                       nr_unshown++;
> -                     return;
> +                     return true;
>               }
>               if (nr_unshown) {
>                       pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>       }
>       if (nr_shown++ == 0)
>               resume = jiffies + 60 * HZ;
> +     return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long 
> addr)
> +{
> +     unsigned long long pgdv, p4dv, pudv, pmdv;

> +     p4d_t p4d, *p4dp;
> +     pud_t pud, *pudp;
> +     pmd_t pmd, *pmdp;
> +     pgd_t *pgdp;
> +
> +     /*
> +      * This looks like a fully lockless walk, however, the caller is
> +      * expected to hold the leaf page table lock in addition to other
> +      * rmap/mm/vma locks. So this is just a re-walk to dump page table
> +      * content while any concurrent modifications should be completely
> +      * prevented.
> +      */

Hmmm :)

Why aren't we trying to lock at leaf level?

We need to:

- Keep VMA stable which prevents unmap page table teardown and khugepaged
  collapse.
- (not relevant as we don't traverse PTE table but) RCU lock for PTE
  entries to avoid MADV_DONTNEED page table withdrawal.

Buuut if we're not locking at leaf level, we leave ourselves open to racing
faults, zaps, etc. etc.

So perhaps this why you require such strict conditions...

But can you truly be sure of these existing? And we should then assert them
here no? For rmap though we'd need the folio/vma.

> +     pgdp = pgd_offset(mm, addr);
> +     pgdv = pgd_val(*pgdp);

Before I went and looked again at the commit msg I said:

        "Shoudln't we strictly speaking use pgdp_get()? I see you use this
         helper for other levels."

But obviously yeah. You explained the insane reason why not.

> +
> +     if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
> +             pr_alert("pgd:%08llx\n", pgdv);
> +             return;
> +     }
> +
> +     p4dp = p4d_offset(pgdp, addr);
> +     p4d = p4dp_get(p4dp);
> +     p4dv = p4d_val(p4d);
> +
> +     if (!p4d_present(p4d) || p4d_leaf(p4d)) {
> +             pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
> +             return;
> +     }
> +
> +     pudp = pud_offset(p4dp, addr);
> +     pud = pudp_get(pudp);
> +     pudv = pud_val(pud);
> +
> +     if (!pud_present(pud) || pud_leaf(pud)) {
> +             pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, 
> pudv);
> +             return;
> +     }
> +
> +     pmdp = pmd_offset(pudp, addr);
> +     pmd = pmdp_get(pmdp);
> +     pmdv = pmd_val(pmd);
> +
> +     /*
> +      * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
> +      * because the table should already be mapped by the caller and
> +      * doing another map would be bad. print_bad_page_map() should
> +      * already take care of printing the PTE.
> +      */

I hate 32-bit kernels.

> +     pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
> +              p4dv, pudv, pmdv);
> +}
> +
> +/*
> + * This function is called to print an error when a bad page table entry 
> (e.g.,
> + * corrupted page table entry) is found. For example, we might have a
> + * PFN-mapped pte in a region that doesn't allow it.
> + *
> + * The calling function must still handle the error.
> + */

We have extremely strict locking conditions for the page table traversal... but
no mention of them here?

> +static void print_bad_page_map(struct vm_area_struct *vma,
> +             unsigned long addr, unsigned long long entry, struct page *page)
> +{
> +     struct address_space *mapping;
> +     pgoff_t index;
> +
> +     if (is_bad_page_map_ratelimited())
> +             return;
>
>       mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>       index = linear_page_index(vma, addr);
>
> -     pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> -              current->comm,
> -              (long long)pte_val(pte), (long long)pmd_val(*pmd));
> +     pr_alert("BUG: Bad page map in process %s  entry:%08llx", 
> current->comm, entry);

Sort of wonder if this is even useful if you don't know what the 'entry'
is? But I guess the dump below will tell you.

Though maybe actually useful to see flags etc. in case some horrid
corruption happened and maybe dump isn't valid? But then the dump assumes
strict conditions to work so... can that happen?

> +     __dump_bad_page_map_pgtable(vma->vm_mm, addr);
>       if (page)
> -             dump_page(page, "bad pte");
> +             dump_page(page, "bad page map");
>       pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
>                (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>       pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps 
> read_folio:%ps\n",
> @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>               if (is_zero_pfn(pfn))
>                       return NULL;
>
> -             print_bad_pte(vma, addr, pte, NULL);
> +             print_bad_page_map(vma, addr, pte_val(pte), NULL);
>               return NULL;
>       }
>
> @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>
>  check_pfn:
>       if (unlikely(pfn > highest_memmap_pfn)) {
> -             print_bad_pte(vma, addr, pte, NULL);
> +             print_bad_page_map(vma, addr, pte_val(pte), NULL);

This is unrelated to your series, but I guess this is for cases where
you're e.g. iomapping or such? So it's not something in the memmap but it's
a PFN that might reference io memory or such?

>               return NULL;
>       }
>
> @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct 
> *vma, unsigned long addr,
>  {
>       unsigned long pfn = pmd_pfn(pmd);
>
> -     if (unlikely(pmd_special(pmd)))
> +     if (unlikely(pmd_special(pmd))) {
> +             if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +                     return NULL;

I guess we'll bring this altogether in a later patch with vm_normal_page()
as getting a little duplicative :P

Makes me think that VM_SPECIAL is kind of badly named (other than fact
'special' is nebulous and overloaded in general) in that it contains stuff
that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
specialness wrt to underlying folio.

Then we have VM_IO, which strictly must not have an associated page right?
Which is the odd one out and I wonder if redundant somehow.

Anyway stuff to think about...

> +             if (is_huge_zero_pfn(pfn))
> +                     return NULL;
> +
> +             print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>               return NULL;
> +     }
>
>       if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>               if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -674,8 +739,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct 
> *vma, unsigned long addr,
>
>       if (is_huge_zero_pfn(pfn))
>               return NULL;
> -     if (unlikely(pfn > highest_memmap_pfn))
> +     if (unlikely(pfn > highest_memmap_pfn)) {
> +             print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>               return NULL;
> +     }
>
>       /*
>        * NOTE! We still have PageReserved() pages in the page tables.
> @@ -1509,7 +1576,7 @@ static __always_inline void 
> zap_present_folio_ptes(struct mmu_gather *tlb,
>               folio_remove_rmap_ptes(folio, page, nr, vma);
>
>               if (unlikely(folio_mapcount(folio) < 0))
> -                     print_bad_pte(vma, addr, ptent, page);
> +                     print_bad_page_map(vma, addr, pte_val(ptent), page);
>       }
>       if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>               *force_flush = true;
> @@ -4507,7 +4574,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>               } else if (is_pte_marker_entry(entry)) {
>                       ret = handle_pte_marker(vmf);
>               } else {
> -                     print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
> +                     print_bad_page_map(vma, vmf->address,
> +                                        pte_val(vmf->orig_pte), NULL);
>                       ret = VM_FAULT_SIGBUS;
>               }
>               goto out;
> --
> 2.50.1
>



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.