From: Yafang Shao <laoar.shao@gmail.com>
To: David Hildenbrand <david@redhat.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>,
akpm@linux-foundation.org, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
Date: Sun, 20 Jul 2025 10:32:50 +0800 [thread overview]
Message-ID: <CALOAHbBoZpAartkb-HEwxJZ90Zgn+u6G4fCC0_Wq-shKqnb6iQ@mail.gmail.com> (raw)
In-Reply-To: <9bc57721-5287-416c-aa30-46932d605f63@redhat.com>
On Thu, Jul 17, 2025 at 4:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.07.25 05:09, Yafang Shao wrote:
> > On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.06.25 09:35, Yafang Shao wrote:
> >>
> >> Sorry for not replying earlier, I was caught up with all other stuff.
> >>
> >> I still consider this a very interesting approach, although I think we
> >> should think more about what a reasonable policy would look like
> >> medoium-term (in particular, multiple THP sizes, not always falling back
> >> to small pages if it means splitting excessively in the buddy etc.)
> >
> > I find it difficult to understand why we introduced the mTHP sysfs
> > knobs instead of implementing automatic THP size switching within the
> > kernel. I'm skeptical about its practical utility in real-world
> > workloads.
> >
> > In contrast, XFS large folio (AKA. File THP) can automatically select
> > orders between 0 and 9. Based on our verification, this feature has
> > proven genuinely useful for certain specific workloads—though it's not
> > yet perfect.
>
> I suggest you do some digging about the history of these toggles and the
> plans for the future (automatic), there has been plenty of talk about
> all that.
>
> [...]
>
> >>>
> >>> - THP allocator
> >>>
> >>> int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> >>>
> >>> The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
> >>> indicating whether THP allocation should be performed synchronously
> >>> (current task) or asynchronously (khugepaged).
> >>>
> >>> The decision is based on the current task context, VMA flags, and TVA
> >>> flags.
> >>
> >> I think we should go one step further and actually get advises about the
> >> orders (THP sizes) to use. It might be helpful if the program would have
> >> access to system stats, to make an educated decision.
> >>
> >> Given page fault information and system information, the program could
> >> then decide which orders to try to allocate.
> >
> > Yes, that aligns with my thoughts as well. For instance, we could
> > automate the decision-making process based on factors like PSI, memory
> > fragmentation, and other metrics. However, this logic could be
> > implemented within BPF programs—all we’d need is to extend the feature
> > by introducing a few kfuncs (also known as BPF helpers).
>
> We discussed this yesterday at a THP upstream meeting, and what we
> should look into is:
>
> (1) Having a callback like
>
> unsigned int (*get_suggested_order)(.., bool in_pagefault);
This interface meets our needs precisely, enabling allocation orders
of either 0 or 9 as required by our workloads.
>
> Where we can provide some information about the fault (vma
> size/flags/anon_name), and whether we are in the page fault (or in
> khugepaged).
>
> Maybe we want a bitmap of orders to try (fallback), not sure yet.
>
> (2) Having some way to tag these callbacks as "this is absolutely
> unstable for now and can be changed as we please.".
BPF has already helped us complete this, so we don’t need to implement
this restriction.
Note that all BPF kfuncs (including struct_ops) are currently unstable
and may change in the future.
Alexei, could you confirm this understanding?
>
> One idea will be to use this mechanism as a way to easily prototype
> policies, and once we know that a policy works, start moving it into the
> core.
>
> In general, the core, without a BPF program, should be able to continue
> providing a sane default behavior.
makes sense.
>
> >
> >>
> >> That means, one would query during page faults and during khugepaged,
> >> which order one should try -- compared to our current approach of "start
> >> with the largest order that is enabled and fits".
> >>
> >>>
> >>> - THP reclaimer
> >>>
> >>> int (*reclaimer)(bool vma_madvised);
> >>>
> >>> The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
> >>> determining whether memory reclamation is handled by the current task or
> >>> kswapd.
> >>
> >> Not sure about that, will have to look into the details.
> >
> > Some workloads allocate all their memory during initialization and do
> > not require THP at runtime. For such cases, aggressively attempting
> > THP allocation is beneficial. However, other workloads may dynamically
> > allocate THP during execution—if these are latency-sensitive, we must
> > avoid introducing long allocation delays.
> >
> > Given these differing requirements, the global
> > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
> > Instead, we should implement per-workload defrag policies to better
> > optimize performance based on individual application behavior.
>
> We'll be very careful about the callbacks we will offer. Maybe the
> get_suggested_order() callback could itself make a decision and not
> suggest a high order if allocation would require comapction.
>
> Initially, we should keep it simple and see what other callbacks to add
> / how to extend get_suggested_order(), to cover these cases.
Yes, we can proceed by adding a simple get_suggested_order() and
address any remaining details in follow-up work.
--
Regards
Yafang
next prev parent reply other threads:[~2025-07-20 2:33 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-08 7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
2025-07-17 14:48 ` Usama Arif
2025-07-20 2:37 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
2025-07-17 15:30 ` Usama Arif
2025-07-20 3:00 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
2025-07-17 16:06 ` Usama Arif
2025-07-20 3:03 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
2025-07-17 16:25 ` Usama Arif
2025-07-17 18:21 ` Amery Hung
2025-07-20 3:07 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
2025-07-17 3:09 ` Yafang Shao
2025-07-17 8:52 ` David Hildenbrand
2025-07-17 9:05 ` Lorenzo Stoakes
2025-07-20 2:32 ` Yafang Shao [this message]
2025-07-20 15:56 ` David Hildenbrand
2025-07-22 2:40 ` Yafang Shao
2025-07-22 7:28 ` David Hildenbrand
2025-07-22 10:09 ` Lorenzo Stoakes
2025-07-22 11:56 ` Yafang Shao
2025-07-22 12:04 ` Lorenzo Stoakes
2025-07-22 12:16 ` Yafang Shao
2025-07-22 11:46 ` Yafang Shao
2025-07-22 11:54 ` Lorenzo Stoakes
2025-07-22 12:02 ` Yafang Shao
2025-07-22 12:08 ` Lorenzo Stoakes
2025-07-17 16:35 ` Usama Arif
2025-07-20 2:54 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbBoZpAartkb-HEwxJZ90Zgn+u6G4fCC0_Wq-shKqnb6iQ@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).