linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	hannes@cmpxchg.org, usamaarif642@gmail.com,
	gutierrez.asier@huawei-partners.com, willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net,
	bpf@vger.kernel.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
Date: Fri, 29 Aug 2025 11:42:16 +0100	[thread overview]
Message-ID: <95a32a87-5fa8-4919-8166-e9958d6d4e38@lucifer.local> (raw)
In-Reply-To: <CALOAHbCQucvD968pgmMzv0dcg1j5cJ+Nxz4FKaiGXajXXBcs0Q@mail.gmail.com>

On Fri, Aug 29, 2025 at 11:01:59AM +0800, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > > > Also will mm ever != vma->vm_mm?
> > >
> > > No it can't. It can be guaranteed by the caller.
> >
> > In this case we don't need to pass mm separately then right?
>
> Right, we need to pass either @mm or @vma. However, there are cases
> where vma information is not available at certain call sites, such as
> in khugepaged. In those cases, we need to pass @mm instead.

Yeah... this is weird to me though, are you checking in _general_ what
khugepaged should use, or otherwise surely it's per-VMA?

Otherwise this bpf hook seems ill-suited for that, and we should have a
separate one for khugepaged surely?

I also hate that we're passing mm _just because of this one edge case_,
otherwise always passing vma->vm_mm, it's a confusing interface.

>
> >
> > >
> > > >
> > > > Are we hacking this for the sake of overloading what this does?
> > >
> > > The @vma is actually unneeded. I will remove it.
> >
> > Ah OK.
> >
> > I am still a little concerned about passing around a value reference to the VMA
> > flags though, esp as this type can + will change in future (not sure what that
> > means for BPF).
> >
> > We may go to e.g. a 128 bit bitmap there etc.
>
> As mentioned in another thread, we only need to determine whether the
> flag is VM_HUGEPAGE or VM_NOHUGEPAGE, so it can be simplified.

OK cool thanks. Maybe missed.

>
> >
> >
> > >
> > > >
> > > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > > disucssion) - shouldn't it return an unsigned long?
> > >
> > > We are indifferent to whether a single order or a bitmask is returned,
> > > as we only use order-0 and order-9. We have no use cases for
> > > middle-order pages, though this feature might be useful for other
> > > architectures or for some special use cases.
> >
> > Well surely we want to potentially specify a mTHP under certain circumstances
> > no?
>
> Perhaps there are use cases, but I haven’t found any use cases for
> this in our production environment. On the other hand, I can clearly
> see a risk that it could lead to more costly high-order allocations.

So why are we returning a bitmap then? Seems like we should just return a
single order in this case... I think you say below that you are open to
this?

>
> >
> > In any case I feel it's worth making any bitfield a system word size.

Also :>)

If we do move to returning a single order, should be unsigned int.

> >
> > >
> > > >
> > > > > +#else
> > > > > +static inline int
> > > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > > > +{
> > > > > +     return orders;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  static inline int highest_order(unsigned long orders)
> > > > >  {
> > > > >       return fls_long(orders) - 1;
> > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > > > index eb1946a70cff..d81c1228a21f 100644
> > > > > --- a/include/linux/khugepaged.h
> > > > > +++ b/include/linux/khugepaged.h
> > > > > @@ -4,6 +4,8 @@
> > > > >
> > > > >  #include <linux/mm.h>
> > > > >
> > > > > +#include <linux/huge_mm.h>
> > > > > +
> > > >
> > > > Hm this is iffy too, There's probably a reason we didn't include this before,
> > > > the headers can be so so fragile. Let's be cautious...
> > >
> > > I will check.
> >
> > Thanks!
> >
> > >
> > > >
> > > > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > >  extern struct attribute_group khugepaged_attr_group;
> > > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > > >
> > > > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > > > >  {
> > > > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > > > +     /*
> > > > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > > > +      * child can allocate THPs.
> > > > > +      *
> > > > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > > > +      */
> > > > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > > >
> > > > Hmmm so there seems to be some kind of additional functionality you're providing
> > > > here kinda quietly, which is to allow the exact same interface to determine
> > > > whether we kick off khugepaged or not.
> > > >
> > > > Don't love that, I think we should be hugely specific about that.
> > > >
> > > > This bpf interface should literally be 'ok we're deciding what order we
> > > > want'. It feels like a bit of a gross overloading?
> > >
> > > This makes sense. I have no objection to reverting to returning a single order.
> >
> > OK but key point here is - we're now determining if a forked child can _not_
> > allocate THPs using this function.
> >
> > To me this should be a separate function rather than some _weird_ usage of this
> > same function.
>
> Perhaps a separate function is better.

Thanks!

>
> >
> > And generally at this point I think we should just drop this bit of code
> > honestly.
>
> MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
> it’s set, any forked child processes will inherit this flag. It is
> only cleared when the mm_struct is destroyed (please correct me if I’m
> wrong).

__mmput()
-> khugepaged_exit()
-> (if MMF_VM_HUGEPAGE set) __khugepaged_exit()
-> Clear flag once mm fully done with (afaict), dropping associated mm refcount.

^--- this does seem to be accurate indeed.

>
> However, when you switch the THP mode to "never", tasks that still
> have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> issue under the current global mode because khugepaged doesn’t run
> when THP is set to "never".
>
> The problem arises when we move from a global mode to a per-task mode.
> In that case, khugepaged may end up doing unnecessary work. For
> example, if the THP mode is "always", but some tasks are not allowed
> to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> will continue scanning them unnecessarily.

But this can change right?

I really don't like the idea _at all_ of overriding this hook to do things
other than what it says it does.

It's 'set which order to use' except when it's this case then it's 'will we
do any work'.

This should be a separate callback or we should drop this and live with the
possible additional work.

>
> To avoid this, we should prevent setting this flag for child processes
> if they are not allowed to allocate THP in the first place. This way,
> khugepaged won’t waste cycles scanning them. While an alternative
> approach would be to set the flag at fork and later clear it for
> khugepaged, it’s clearly more efficient to avoid setting it from the
> start.

We also obviously should have a comment with all this context here.


> >
> > >
> > > >
> > > > > +     if (highest_order(suggested_orders) > highest_order(orders))
> > > > > +             suggested_orders = orders;
> > > >
> > > > Hmmm so the semantics are - whichever is the highest order wins?
> > >
> > > The maximum requested order is determined by the callsite. For example:
> > > - PMD-mapped THP uses PMD_ORDER
> > > - mTHP uses (PMD_ORDER - 1)
> > >
> > > We must respect this upper bound to avoid undefined behavior. So the
> > > highest suggested order can't exceed the highest requested order.
> >
> > OK, please document this in a comment here.
>
> will doc it.

Thanks!

>
> >
> > >
> > > >
> > > > I thought the idea was we'd hand control over to bpf if provided in effect?
> > > >
> > > > Definitely worth going over these semantics in the cover letter (and do forgive
> > > > me if you have and I've missed! :)
> > >
> > > It has already in the cover letter:
> > >
> > >  * Return: Bitmask of suggested THP orders for allocation. The highest
> > >  *         suggested order will not exceed the highest requested order
> > >  *         in @orders.
> >
> > OK cool thanks, a comment here would be useful also.
>
> will add it.

Thanks!

> > > >
> > > > > +
> > > > > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > > > > +     .verifier_ops = &thp_bpf_verifier_ops,
> > > > > +     .init = bpf_thp_init,
> > > > > +     .init_member = bpf_thp_init_member,
> > > > > +     .reg = bpf_thp_reg,
> > > > > +     .unreg = bpf_thp_unreg,
> > > > > +     .update = bpf_thp_update,
> > > > > +     .validate = bpf_thp_validate,
> > > > > +     .cfi_stubs = &__bpf_thp_ops,
> > > > > +     .owner = THIS_MODULE,
> > > > > +     .name = "bpf_thp_ops",
> > > > > +};
> > > > > +
> > > > > +static int __init bpf_thp_ops_init(void)
> > > > > +{
> > > > > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > > > > +
> > > > > +     if (err)
> > > > > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > > > > +     return err;
> > > > > +}
> > > > > +late_initcall(bpf_thp_ops_init);
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index d89992b65acc..bd8f8f34ab3c 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > > > >               return ret;
> > > > >       khugepaged_enter_vma(vma, vma->vm_flags);
> > > > >
> > > > > +     /*
> > > > > +      * This check must occur after khugepaged_enter_vma() because:
> > > > > +      * 1. We may permit THP allocation via khugepaged
> > > > > +      * 2. While simultaneously disallowing THP allocation
> > > > > +      *    during page fault handling
> > > > > +      */
> > > > > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > > > > +                             BIT(PMD_ORDER))
> > > >
> > > > Hmmm so you return a bitmask of orders, but then you only allow this fault if
> > > > the only order provided is PMD order? That seems strange. Can you explain?
> > >
> > > This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
> > > order, otherwise it might result in unexpected behavior.
> >
> > OK please document this in the comment.
>
> will doc it.

Thanks!

>
>
> --
> Regards
> Yafang

Cheers, Lorenzo


  reply	other threads:[~2025-08-29 10:43 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
2025-08-27  2:57   ` kernel test robot
2025-08-27 11:39     ` Yafang Shao
2025-08-27 15:04       ` Lorenzo Stoakes
2025-08-27 15:03   ` Lorenzo Stoakes
2025-08-28  5:54     ` Yafang Shao
2025-08-28 10:50       ` Lorenzo Stoakes
2025-08-29  3:01         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes [this message]
2025-08-31  3:11             ` Yafang Shao
2025-09-01 11:39               ` Lorenzo Stoakes
2025-09-02  2:48                 ` Yafang Shao
2025-09-02  7:50                   ` Lorenzo Stoakes
2025-09-03  2:10                     ` Yafang Shao
2025-08-29  4:56   ` Barry Song
2025-08-29  5:36     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-27 15:34   ` Lorenzo Stoakes
2025-08-27 20:50     ` Shakeel Butt
2025-08-28 10:40       ` Lorenzo Stoakes
2025-08-28 16:00         ` Shakeel Butt
2025-08-29 10:45           ` Lorenzo Stoakes
2025-08-28  6:57     ` Yafang Shao
2025-08-28 10:42       ` Lorenzo Stoakes
2025-08-29  3:09         ` Yafang Shao
2025-08-27 20:45   ` Shakeel Butt
2025-08-28  6:58     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-27 15:42   ` Lorenzo Stoakes
2025-08-27 21:50     ` Andrii Nakryiko
2025-08-28  6:50       ` Yafang Shao
2025-08-28 10:51       ` Lorenzo Stoakes
2025-08-29  3:15         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes
2025-08-28  6:47     ` Yafang Shao
2025-08-29 10:43       ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-27 15:45   ` Lorenzo Stoakes
2025-08-28  6:12     ` Yafang Shao
2025-08-28 11:11       ` Lorenzo Stoakes
2025-08-29  3:05         ` Yafang Shao
2025-08-29 10:49           ` Lorenzo Stoakes
2025-08-31  3:16             ` Yafang Shao
2025-09-01 10:36               ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
2025-08-27 15:47   ` Lorenzo Stoakes
2025-08-28  6:08     ` Yafang Shao
2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
2025-08-26  8:33   ` Lorenzo Stoakes
2025-08-26 12:06     ` Yafang Shao
2025-08-26  9:52   ` Usama Arif
2025-08-26 12:10     ` Yafang Shao
2025-08-26 12:03   ` Yafang Shao
2025-08-27 13:14 ` Lorenzo Stoakes
2025-08-28  2:58   ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95a32a87-5fa8-4919-8166-e9958d6d4e38@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).