All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oscar Salvador <osalvador@suse.de>
To: David Hildenbrand <david@redhat.com>
Cc: lsf-pc@lists.linux-foundation.org, Peter Xu <peterx@redhat.com>,
	Muchun Song <muchun.song@linux.dev>,
	linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
Date: Mon, 3 Feb 2025 11:10:56 +0100	[thread overview]
Message-ID: <Z6CWMBKlN0-RHI2P@localhost.localdomain> (raw)
In-Reply-To: <4c50a439-e2b8-4f54-ba3d-366d0e2961b2@redhat.com>

On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
> On 30.01.25 22:36, Oscar Salvador wrote:
> Hi Oscar,

Hi David

> > 
> > HugeTLB has its own way of dealing with things.
> > E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> > huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> > 
> > One of the challenges that this raises is that if we want pmd/pud walkers to
> > be able to make sense of hugetlb stuff, we need to implement pud/pmd
> > (maybe some pmd we already have because of THP) variants of those.
> 
> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
> abstracted by hugetlb to be a single unit ("hugetlb pte").

Yes, that is a PITA to be honest.
E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
assumption.
I was checking how we could make cont-{pmd,pte} work for hugetlb in that
area (just because it happens to be the first are I am trying to convert).
E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.

I came up with something like the following (completely untested, just a
PoC) to make smaps_pte_range work for cont-ptes.
We would also need to change all the references of PAGE_SIZE to the
actual size of the folio (if there are).

But looking closer, folio_pte_batch is only declared in mm/internal.h,
so I am not sure I can make use of that outside mm/ realm?
If not, how could we work that around?

 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index 193dd5f91fbf..21fb122ebcac 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
  #endif
 
  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 -		struct mm_walk *walk)
 +		struct mm_walk *walk, int nr_batch_entries)
  {
  	struct mem_size_stats *mss = walk->private;
  	struct vm_area_struct *vma = walk->vma;
 @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
  	if (pte_present(ptent)) {
  		page = vm_normal_page(vma, addr, ptent);
 -		young = pte_young(ptent);
 -		dirty = pte_dirty(ptent);
 +		if (nr_batch_entries) {
 +			int nr;
 +			nr = folio_pte_batch(folio_page(page, 0), addr, pte,
 +					     ptent, max_nr, 0, NULL, &young,
 +					     &dirty);
 +			if (nr != nr_batch_entries) {
 +				walk->action = ACTION_AGAIN;
 +				return;
 +			}
 +		} else {
 +			young = pte_young(ptent);
 +			dirty = pte_dirty(ptent);
 +		}
  		present = true;
  	} else if (is_swap_pte(ptent)) {
  		swp_entry_t swpent = pte_to_swp_entry(ptent);
 @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
  			   struct mm_walk *walk)
  {
  	struct vm_area_struct *vma = walk->vma;
 -	pte_t *pte;
 +	pte_t *pte, ptent;
  	spinlock_t *ptl;
 
  	ptl = pmd_huge_lock(pmd, vma);
 @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
  		walk->action = ACTION_AGAIN;
  		return 0;
  	}
 +
 +	ptent = ptep_get(pte);
 +	if (pte_present(ptent)) {
 +		struct folio *folio = vm_normal_folio(vma, addr, ptent);
 +		if (folio_test_large(folio)) {
 +			/* Let us use pte batching */
 +			smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
 +			pte_unmap_unlock(pte - 1, ptl);
 +
 +			return 0;
 +		}
 +	}
 +
  	for (; addr != end; pte++, addr += PAGE_SIZE)
 -		smaps_pte_entry(pte, addr, walk);
 +		smaps_pte_entry(pte, addr, walk, 0);
  	pte_unmap_unlock(pte - 1, ptl);
  out:
  	cond_resched();



> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
> managed: you can modify any PTE part of the cont-gang and it will just
> work as expected, transparently.

But that is because AFAIU, on arm64 it knows if it's dealing with a
cont-pte and it queries the status of the whole "gang", right?

I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
with cont-ptes and modifies/queries all sub-ptes.
How is that done for non-hugetlb pages?

> Not so with hugetlb, where you have to modify (or even query) the whole thing.

I see.

> 
> For GUP it was easier, because it was able to grab all information it needed
> from the sub-ptes fairly easily, and it doesn't modify any page tabls.
> 
> 
> I ran into this problem with folio_walk, and had to document it rather nastily:
> 
>  * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.
>  * For example, PMD page table sharing might require prior unsharing. Also,
>  * logical hugetlb entries might span multiple physical page table entries,
>  * which *must* be modified in a single operation (set_huge_pte_at(),
>  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
>  * not correspond to the first physical entry of a logical hugetlb entry.
> 
> I wanted to use it to rewrite the uprobe code to also handle hugetlb with
> less special casing, but that work stalled so far. I think my next attempt would rule
> out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.
> 
> It all gets weird with things like:
> 
> commit 0549e76663730235a10395a7af7ad3d3ce6e2402
> Author: Christophe Leroy <christophe.leroy@csgroup.eu>
> Date:   Tue Jul 2 15:51:25 2024 +0200
> 
>     powerpc/8xx: rework support for 8M pages using contiguous PTE entries
>     In order to fit better with standard Linux page tables layout, add support
>     for 8M pages using contiguous PTE entries in a standard page table.  Page
>     tables will then be populated with 1024 similar entries and two PMD
>     entries will point to that page table.
>     The PMD entries also get a flag to tell it is addressing an 8M page, this
>     is required for the HW tablewalk assistance.
> 
> Where we are walking a PTE table, but actually there is another PTE table we
> have to modify in the same go.
> 
> 
> Very hard to make that non-hugetlb aware, as it's simply completely different compared
> to ordinary page table walking/modifications today.
> 
> Maybe there are ideas to tackle that, and I'd be very interested in them.

Yes, that one is very nasty, and as I recall from other conversations
with Christophe, we ran out of bits (powerpc-8xx) to flag it somehow
that that entry is special, so not really sure either.
Although it would be a pitty that that is the thing that ends up
stalling this


-- 
Oscar Salvador
SUSE Labs


  parent reply	other threads:[~2025-02-03 10:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-30 21:36 [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Oscar Salvador
2025-01-30 22:45 ` Peter Xu
2025-01-30 22:46 ` Matthew Wilcox
2025-01-30 23:19 ` David Hildenbrand
2025-01-31 15:42   ` Christophe Leroy
2025-02-04 20:19     ` David Hildenbrand
2025-02-03 10:10   ` Oscar Salvador [this message]
2025-02-04 20:40     ` David Hildenbrand
2025-02-05  9:33       ` Oscar Salvador
2025-02-11 13:31         ` David Hildenbrand
2025-02-12  9:13           ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6CWMBKlN0-RHI2P@localhost.localdomain \
    --to=osalvador@suse.de \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.