linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
	 baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com,  ryan.roberts@arm.com, dev.jain@arm.com,
	hannes@cmpxchg.org,  usamaarif642@gmail.com,
	gutierrez.asier@huawei-partners.com,  willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	 ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net,
	bpf@vger.kernel.org,  linux-mm@kvack.org,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
Date: Thu, 28 Aug 2025 10:58:57 +0800	[thread overview]
Message-ID: <CALOAHbA7wT_LF0Sr2jGWxKU52d-tmHt1sBBjM1koja64t1vi2Q@mail.gmail.com> (raw)
In-Reply-To: <06d7bde9-e3f8-45fd-9674-2451b980ef13@lucifer.local>

On Wed, Aug 27, 2025 at 9:14 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:38PM +0800, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>

Thank you for providing so many comments.  I'll take some time to go
through it carefully and will reply afterward.

> I think it's important to highlight here that we are exploring an _experimental_
> implementation.

I will add it.

>
> >
> > Proposed Solution
> > =================
> >
> > As suggested by David [0], we introduce a new BPF interface:
>
> I do agree, to be clear, with this broad approach - that is, to provide the
> minimum information that a reasonable decision can be made upon and to keep
> things as simple as we can.
>
> As per the THP cabal (I think? :) the general consensus was in line with
> this.

My testing in both test and production indicates the following
parameters are essential:
- mm_struct (associated with the THP allocation)
- vma_flags (VM_HUGEPAGE, VM_NOHUGEPAGE, or N/A)
- tva_type
- The requested THP orders bitmask

I will retain these four and remove @vma__nullable.

>
>
> >
> > /**
> >  * @get_suggested_order: Get the suggested THP orders for allocation
> >  * @mm: mm_struct associated with the THP allocation
> >  * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> >  *                 When NULL, the decision should be based on @mm (i.e., when
> >  *                 triggered from an mm-scope hook rather than a VMA-specific
> >  *                 context).
>
> I'm a little wary of handing a VMA to BPF, under what locking would it be
> provided?

We cannot arbitrarily use members of the struct vm_area_struct because
they are untrusted pointers. The only trusted pointer is vma->vm_mm,
which can be accessed without holding any additional locks. For the
VMA itself, the caller at the callsite has already taken the necessary
locks, so we do not need to acquire them again.

My testing shows the @vma parameter is not needed. I will remove it in
the next update.

>
> >  *                 Must belong to @mm (guaranteed by the caller).
> >  * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>
> Hmm this one is also a bit odd - why would these flags differ? Note that I will
> be changing the VMA flags to a bitmap relatively soon which may be larger than
> the system word size.
>
> So 'handing around all the flags' is something we probably want to avoid.

Good suggestion. Since we specifically need to identify VM_HUGEPAGE or
VM_NOHUGEPAGE, I will add a new enum for clarity, bpf_thp_vma_type:

+enum bpf_thp_vma_type {
+       BPF_VM_NONE = 0,
+       BPF_VM_HUGEPAGE,        /* VM_HUGEPAGE */
+       BPF_VM_NOHUGEPAGE,      /* VM_NOHUGEPAGE */
+};

The enum can be extended in the future to support file-backed THP by
adding new types.

>
> For the f_op->mmap_prepare stuff I provided an abstraction
>
> >  * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>
> Obv. a cover letter thing but typo her :P rerurn -> return.

will change it.

>
> >  *         suggested order will not exceed the highest requested order
> >  *         in @orders.
>
> In what sense are they 'suggested'? Is this a product of sysfs settings or? I
> think this needs to be clearer.

The order is suggested by a BPF program. I will clarify it in the next version.

>
> >  */
> >  int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >                             u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>
> Also here in what sense is this suggested? :)

Agreed. I'll rename it to bpf_hook_thp_get_order() as suggested for clarity.

>
> >
> > This interface:
> > - Supports both use cases (per-workload tuning + policy prototyping).
> > - Can be extended with BPF helpers (e.g., for memory pressure awareness).
>
> Hm how would extensions like this work?

To optimize THP allocation, we should consult the PSI  data
beforehand. If memory pressure is already high—indicating difficulty
in allocating high-order pages—the system should default to allocating
4K pages instead. This could be implemented by checking the PSI data
of the relevant cgroup:

  struct cgroup *cgrp = task_dfl_cgroup(mm->owner);
  struct psi_group *psi = cgroup_psi(cgrp);  // or psi_system
  u64 psi_data = psi->total[PSI_AVGS][PSI_MEM];

The allocation strategy would then branch based on the value of
psi_data. This may require new BPF helpers to access PSI data
efficiently.

>
> >
> > This is an experimental feature. To use it, you must enable
> > CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.
>
> Yes! Thanks. I am glad we are putting this behind a config flag.
>
> >
> > Warning:
> > - The interface may change
> > - Behavior may differ in future kernel versions
> > - We might remove it in the future
> >
> >
> > Selftests
> > =========
> >
> > BPF selftests
> > -------------
> >
> > Patch #5: Implements a basic BPF THP policy that restricts THP allocation
> >           via khugepaged to tasks within a specified memory cgroup.
> > Patch #6: Contains test cases validating the khugepaged fork behavior.
> > Patch #7: Provides tests for dynamic BPF program updates and replacement.
> > Patch #8: Includes negative tests for invalid BPF helper usage, verifying
> >           proper verification by the BPF verifier.
> >
> > Currently, several dependency patches reside in mm-new but haven't been
> > merged into bpf-next:
> >   mm: add bitmap mm->flags field
> >   mm/huge_memory: convert "tva_flags" to "enum tva_type"
> >   mm: convert core mm to mm_flags_*() accessors
> >
> > To enable BPF CI testing, these dependencies were manually applied to
> > bpf-next [1]. All selftests in this series pass successfully. The observed
> > CI failures are unrelated to these changes.
>
> Cool, glad at least my mm changes were ok :)
>
> >
> > Performance Evaluation
> > ----------------------
> >
> > As suggested by Usama [2], performance impact was measured given the page
> > fault handler modifications. The standard `perf bench mem memset` benchmark
> > was employed to assess page fault performance.
> >
> > Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
> > node). Due to variance between individual test runs, a script executed
> > 10000 iterations to calculate meaningful averages and standard deviations.
> >
> > The results across three configurations show negligible performance impact:
> > - Baseline (without this patch series)
> > - With patch series but no BPF program attached
> > - With patch series and BPF program attached
> >
> > The result are as follows,
> >
> >   Number of runs: 10,000
> >   Average throughput: 40-41 GB/sec
> >   Standard deviation: 7-8 GB/sec
>
> You're not giving data comparing the 3? Could you do so? Thanks.

I tested all three cases. The results from the three test cases were
similar, so I aggregated the data.

>
> >
> > Production verification
> > -----------------------
> >
> > We have successfully deployed a variant of this approach across numerous
> > Kubernetes production servers. The implementation enables THP for specific
> > workloads (such as applications utilizing ZGC [3]) while disabling it for
> > others. This selective deployment has operated flawlessly, with no
> > regression reports to date.
> >
> > For ZGC-based applications, our verification demonstrates that shmem THP
> > delivers significant improvements:
> > - Reduced CPU utilization
> > - Lower average latencies
>
> Obviously it's _really key_ to point out that this feature is intendend to
> be _absolutely_ ephemeral - we may or may not implement something like this
> - it's really about both exploring how such an interface might look and
> also helping to determine how an 'automagic' future might look.

Our users can benefit from this feature, which is why we have already
deployed it on our production servers. We are now extending it to more
workloads, such as RDMA applications, where THP provides significant
performance gains. Given the complexity of our production environment,
we have found that manual control is a necessary practice. I am
presenting this case solely to demonstrate the feature's stability and
that it does not introduce regressions. However, I understand this use
case is not recommended by the maintainers and will clarify this in
the next version.

>
> >
> > Future work
> > ===========
> >
> > Based on our validation with production workloads, we observed mixed
> > results with XFS large folios (also known as File THP):
> >
> > - Performance Benefits
> >   Some workloads demonstrated significant improvements with XFS large
> >   folios enabled
> > - Performance Regression
> >   Some workloads experienced degradation when using XFS large folios
> >
> > These results demonstrate that File THP, similar to anonymous THP, requires
> > a more granular approach instead of a uniform implementation.
> >
> > We will extend the BPF-based order selection mechanism to support File THP
> > allocation policies.
> >
> > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > Link: https://github.com/kernel-patches/bpf/pull/9561 [1]
> > Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com/ [2]
> > Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3]
> >
> > Changes:
> > =======
> >
> > RFC v5-> v6:
> > - Code improvement around the RCU usage (Usama)
> > - Add selftests for khugepaged fork (Usama)
> > - Add performance data for page fault (Usama)
> > - Remove the RFC tag
> >
>
> Sorry I haven't been involved in the RFC reviews, always intended to but
> workload etc.
>
> Will be looking through this series as very interested in exploring this
> approach.

Thanks a lot for your reviews.

-- 
Regards
Yafang

      reply	other threads:[~2025-08-28  2:59 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
2025-08-27  2:57   ` kernel test robot
2025-08-27 11:39     ` Yafang Shao
2025-08-27 15:04       ` Lorenzo Stoakes
2025-08-27 15:03   ` Lorenzo Stoakes
2025-08-28  5:54     ` Yafang Shao
2025-08-28 10:50       ` Lorenzo Stoakes
2025-08-29  3:01         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes
2025-08-31  3:11             ` Yafang Shao
2025-09-01 11:39               ` Lorenzo Stoakes
2025-09-02  2:48                 ` Yafang Shao
2025-09-02  7:50                   ` Lorenzo Stoakes
2025-09-03  2:10                     ` Yafang Shao
2025-08-29  4:56   ` Barry Song
2025-08-29  5:36     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-27 15:34   ` Lorenzo Stoakes
2025-08-27 20:50     ` Shakeel Butt
2025-08-28 10:40       ` Lorenzo Stoakes
2025-08-28 16:00         ` Shakeel Butt
2025-08-29 10:45           ` Lorenzo Stoakes
2025-08-28  6:57     ` Yafang Shao
2025-08-28 10:42       ` Lorenzo Stoakes
2025-08-29  3:09         ` Yafang Shao
2025-08-27 20:45   ` Shakeel Butt
2025-08-28  6:58     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-27 15:42   ` Lorenzo Stoakes
2025-08-27 21:50     ` Andrii Nakryiko
2025-08-28  6:50       ` Yafang Shao
2025-08-28 10:51       ` Lorenzo Stoakes
2025-08-29  3:15         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes
2025-08-28  6:47     ` Yafang Shao
2025-08-29 10:43       ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-27 15:45   ` Lorenzo Stoakes
2025-08-28  6:12     ` Yafang Shao
2025-08-28 11:11       ` Lorenzo Stoakes
2025-08-29  3:05         ` Yafang Shao
2025-08-29 10:49           ` Lorenzo Stoakes
2025-08-31  3:16             ` Yafang Shao
2025-09-01 10:36               ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
2025-08-27 15:47   ` Lorenzo Stoakes
2025-08-28  6:08     ` Yafang Shao
2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
2025-08-26  8:33   ` Lorenzo Stoakes
2025-08-26 12:06     ` Yafang Shao
2025-08-26  9:52   ` Usama Arif
2025-08-26 12:10     ` Yafang Shao
2025-08-26 12:03   ` Yafang Shao
2025-08-27 13:14 ` Lorenzo Stoakes
2025-08-28  2:58   ` Yafang Shao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbA7wT_LF0Sr2jGWxKU52d-tmHt1sBBjM1koja64t1vi2Q@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).