All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Yafang Shao" <laoar.shao@gmail.com>, <akpm@linux-foundation.org>,
	<ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	"Nico Pache" <npache@redhat.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Dev Jain" <dev.jain@arm.com>
Cc: <bpf@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
Date: Tue, 29 Apr 2025 11:09:44 -0400	[thread overview]
Message-ID: <D9J7UWF1S5WH.285Y0GXSUD30W@nvidia.com> (raw)
In-Reply-To: <20250429024139.34365-1-laoar.shao@gmail.com>

Hi Yafang,

We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
people there in your next version? (I added them here)

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589

On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> In our container environment, we aim to enable THP selectively—allowing
> specific services to use it while restricting others. This approach is
> driven by the following considerations:
>
> 1. Memory Fragmentation
>    THP can lead to increased memory fragmentation, so we want to limit its
>    use across services.
> 2. Performance Impact
>    Some services see no benefit from THP, making its usage unnecessary.
> 3. Performance Gains
>    Certain workloads, such as machine learning services, experience
>    significant performance improvements with THP, so we enable it for them
>    specifically. 
>
> Since multiple services run on a single host in a containerized environment,
> enabling THP globally is not ideal. Previously, we set THP to madvise,
> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> approach had limitation:
>
> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
>   third-party libraries, bypassing our restrictions.

Basically, you want more precise control of THP enablement and the
ability of overriding madvise() from userspace.

In terms of overriding madvise(), do you have any concrete example of
these third-party libraries? madvise() users are supposed to know what
they are doing, so I wonder why they are causing trouble in your
environment.

>
> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> which is error-injectable, to blacklist unwanted services. While this
> worked, it was error-prone and ineffective for services needing always mode,
> as modifying their code to use madvise was impractical.
>
> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> Now, we dynamically adjust THP settings per service by hooking
> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> enable or disable on a per-service basis without global impact.

hugepage_global_*() are whole system knobs. How did you use it to
achieve per-service control? In terms of per-service, does it mean
you need per-memcg group (I assume each service has its own memcg) THP
configuration?

>
> The hugepage_global_{enabled,always}() functions currently share the same
> BPF hook, which limits THP configuration to either always or never. While
> this suffices for our specific use cases, full support for all three modes
> (always, madvise, and never) would require splitting them into separate
> hooks.
>
> This is the initial RFC patch—feedback is welcome!
>
> Yafang Shao (4):
>   mm: move hugepage_global_{enabled,always}() to internal.h
>   mm: pass VMA parameter to hugepage_global_{enabled,always}()
>   mm: add BPF hook for THP adjustment
>   selftests/bpf: Add selftest for THP adjustment
>
>  include/linux/huge_mm.h                       |  54 +-----
>  mm/Makefile                                   |   3 +
>  mm/bpf.c                                      |  36 ++++
>  mm/bpf.h                                      |  21 +++
>  mm/huge_memory.c                              |  50 ++++-
>  mm/internal.h                                 |  21 +++
>  mm/khugepaged.c                               |  18 +-
>  tools/testing/selftests/bpf/config            |   1 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 176 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  32 ++++
>  10 files changed, 344 insertions(+), 68 deletions(-)
>  create mode 100644 mm/bpf.c
>  create mode 100644 mm/bpf.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c




-- 
Best Regards,
Yan, Zi


  parent reply	other threads:[~2025-04-29 15:09 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29  2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
2025-04-29  2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
2025-04-29 15:13   ` Zi Yan
2025-04-30  2:40     ` Yafang Shao
2025-04-30 12:11       ` Zi Yan
2025-04-30 14:43         ` Yafang Shao
2025-04-29  2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
2025-04-29 15:31   ` Zi Yan
2025-04-30  2:46     ` Yafang Shao
2025-04-29  2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
2025-04-29 15:19   ` Alexei Starovoitov
2025-04-30  2:48     ` Yafang Shao
2025-04-29  2:41 ` [RFC PATCH 4/4] selftests/bpf: Add selftest " Yafang Shao
2025-04-29  3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
2025-04-29  4:53   ` Yafang Shao
2025-04-29 15:09 ` Zi Yan [this message]
2025-04-30  2:33   ` Yafang Shao
2025-04-30 13:19     ` Zi Yan
2025-04-30 14:38       ` Yafang Shao
2025-04-30 15:00         ` Zi Yan
2025-04-30 15:16           ` Yafang Shao
2025-04-30 15:21           ` Liam R. Howlett
2025-04-30 15:37             ` Yafang Shao
2025-04-30 15:53               ` Liam R. Howlett
2025-04-30 16:06                 ` Yafang Shao
2025-04-30 17:45                   ` Johannes Weiner
2025-04-30 17:53                     ` Zi Yan
2025-05-01 19:36                       ` Gutierrez Asier
2025-05-02  5:48                         ` Yafang Shao
2025-05-02 12:00                           ` Zi Yan
2025-05-02 12:18                             ` Yafang Shao
2025-05-02 13:04                               ` David Hildenbrand
2025-05-02 13:06                                 ` Matthew Wilcox
2025-05-02 13:34                                 ` Zi Yan
2025-05-05  2:35                                 ` Yafang Shao
2025-05-05  9:11                           ` Gutierrez Asier
2025-05-05  9:38                             ` Yafang Shao
2025-04-30 17:59         ` Johannes Weiner
2025-05-01  0:40           ` Yafang Shao
2025-04-30 14:40     ` Liam R. Howlett
2025-04-30 14:49       ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D9J7UWF1S5WH.285Y0GXSUD30W@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.