From: Yafang Shao <laoar.shao@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>,
Gutierrez Asier <gutierrez.asier@huawei-partners.com>,
Johannes Weiner <hannes@cmpxchg.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net,
andrii@kernel.org, Baolin Wang <baolin.wang@linux.alibaba.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Nico Pache <npache@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
bpf@vger.kernel.org, linux-mm@kvack.org,
Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
Date: Mon, 5 May 2025 10:35:04 +0800 [thread overview]
Message-ID: <CALOAHbA-PvLmzuv4oWenBRjvRM0KQiDa2nV0OsdTFr5czuLMUw@mail.gmail.com> (raw)
In-Reply-To: <4883bdec-f7f2-4350-bf72-f0fa75c9ddd5@redhat.com>
On Fri, May 2, 2025 at 9:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.05.25 14:18, Yafang Shao wrote:
> > On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 2 May 2025, at 1:48, Yafang Shao wrote:
> >>
> >>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> >>> <gutierrez.asier@huawei-partners.com> wrote:
> >>>>
> >>>>
> >>>> On 4/30/2025 8:53 PM, Zi Yan wrote:
> >>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >>>>>
> >>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>>>>>> If it isn't, can you state why?
> >>>>>>>>>>
> >>>>>>>>>> The main difference is that you are saying it's in a container that you
> >>>>>>>>>> don't control. Your plan is to violate the control the internal
> >>>>>>>>>> applications have over THP because you know better. I'm not sure how
> >>>>>>>>>> people might feel about you messing with workloads,
> >>>>>>>>>
> >>>>>>>>> It’s not a mess. They have the option to deploy their services on
> >>>>>>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>>>>>> This is a two-way decision.
> >>>>>>>>
> >>>>>>>> This implies you want a container-level way of controlling the setting
> >>>>>>>> and not a system service-level?
> >>>>>>>
> >>>>>>> Right. We want to control the THP per container.
> >>>>>>
> >>>>>> This does strike me as a reasonable usecase.
> >>>>>>
> >>>>>> I think there is consensus that in the long-term we want this stuff to
> >>>>>> just work and truly be transparent to userspace.
> >>>>>>
> >>>>>> In the short-to-medium term, however, there are still quite a few
> >>>>>> caveats. thp=always can significantly increase the memory footprint of
> >>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
> >>>>>> as we would like them to be, which for real production systems means
> >>>>>> having to make workload-specifcic choices and tradeoffs.
> >>>>>>
> >>>>>> There is ongoing work in these areas, but we do have a bit of a
> >>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >>>>>> due to limitations in how they can be deployed. For example, we can't
> >>>>>> do thp=always on a DC node that runs arbitary combinations of jobs
> >>>>>> from a wide array of services. Some might benefit, some might hurt.
> >>>>>>
> >>>>>> Yet, it's much easier to improve the kernel based on exactly such
> >>>>>> production experience and data from real-world usecases. We can't
> >>>>>> improve the THP shrinker if we can't run THP.
> >>>>>>
> >>>>>> So I don't see it as overriding whoever wrote the software running
> >>>>>> inside the container. They don't know, and they shouldn't have to care
> >>>>>> about page sizes. It's about letting admins and kernel teams get
> >>>>>> started on using and experimenting with this stuff, given the very
> >>>>>> real constraints right now, so we can get the feedback necessary to
> >>>>>> improve the situation.
> >>>>>
> >>>>> Since you think it is reasonable to control THP at container-level,
> >>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> >>>>> (Asier cc'd)
> >>>>>
> >>>>> In this patchset, Yafang uses BPF to adjust THP global configs based
> >>>>> on VMA, which does not look a good approach to me. WDYT?
> >>>>>
> >>>>>
> >>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >>>>>
> >>>>> --
> >>>>> Best Regards,
> >>>>> Yan, Zi
> >>>>
> >>>> Hi,
> >>>>
> >>>> I believe cgroup is a better approach for containers, since this
> >>>> approach can be easily integrated with the user space stack like
> >>>> containerd and kubernets, which use cgroup to control system resources.
> >>>
> >>> The integration of BPF with containerd and Kubernetes is emerging as a
> >>> clear trend.
> >>>
> >>>>
> >>>> However, I pointed out earlier, the approach I suggested has some
> >>>> flaws:
> >>>> 1. Potential polution of cgroup with a big number of knobs
> >>>
> >>> Right, the memcg maintainers once told me that introducing a new
> >>> cgroup file means committing to maintaining it indefinitely, as these
> >>> interface files are treated as part of the ABI.
> >>> In contrast, BPF kfuncs are considered an unstable API, giving you the
> >>> flexibility to modify them later if needed.
> >>>
> >>>> 2. Requires configuration by the admin
> >>>>
> >>>> Ideally, as Matthew W. mentioned, there should be an automatic system.
> >>>
> >>> Take Matthew’s XFS large folio feature as an example—it was enabled
> >>> automatically. A few years ago, when we upgraded to the 6.1.y stable
> >>> kernel, we noticed this new feature. Since it was enabled by default,
> >>> we assumed the author was confident in its stability. Unfortunately,
> >>> it led to severe issues in our production environment: servers crashed
> >>> randomly, and in some cases, we experienced data loss without
> >>> understanding the root cause.
> >>>
> >>> We began disabling various kernel configurations in an attempt to
> >>> isolate the issue, and eventually, the problem disappeared after
> >>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> >>> kernel version with THP disabled and had to restart hundreds of
> >>> thousands of production servers. It was a nightmare for both us and
> >>> our sysadmins.
> >>>
> >>> Last year, we discovered that the initial issue had been resolved by this patch:
> >>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
> >>> We backported the fix and re-enabled XFS large folios—only to face a
> >>> new nightmare. One of our services began crashing sporadically with
> >>> core dumps. It took us several months to trace the issue back to the
> >>> re-enabled XFS large folio feature. Fortunately, we were able to
> >>> disable it using livepatch, avoiding another round of mass server
> >>> restarts. To this day, the root cause remains unknown. The good news
> >>> is that the issue appears to be resolved in the 6.12.y stable kernel.
> >>> We're still trying to bisect which commit fixed it, though progress is
> >>> slow because the issue is not reliably reproducible.
> >>
> >> This is a very wrong attitude towards open source projects. You sounded
> >> like, whether intended or not, Linux community should provide issue-free
> >> kernels and is responsible for fixing all issues. But that is wrong.
> >> Since you are using the kernel, you could help improve it like Kairong
> >> is doing instead of waiting for others to fix the issue.
> >>
> >>>
> >>> In theory, new features should be enabled automatically. But in
> >>> practice, every new feature should come with a tunable knob. That’s a
> >>> lesson we learned the hard way from this experience—and perhaps
> >>> Matthew did too.
> >>
> >> That means new features will not get enough testing. People like you
> >> will just simply disable all new features and wait for they are stable.
> >> It would never come without testing and bug fixes.
>
Hello David,
Thanks for your replyment.
> We do have the concept of EXPERIMENTAL kernel configs, that are either
> expected get removed completely ("always enabled") or get turned into
> actual long-term kernel options. But yeah, it's always tricky what we
> actually want to put behind such options.
>
> I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...
READ_ONLY_THP_FOR_FS is not enabled in our 6.1 kernel, as we are
cautious about enabling any EXPERIMENTAL feature. XFS large folio
support operates independently of READ_ONLY_THP_FOR_FS. However, it is
automatically enabled when CONFIG_TRANSPARENT_HUGEPAGE is set, as seen
in the 6.1.y stable kernel mapping_large_folio_support().
>
> Distro kernels are usually very careful about what to backport and what
> to support. Once we (working for a distro) do backport + test, we
> usually find some additional things that upstream hasn't spotted yet: in
> particular, because some workloads are only run in that form on distro
> kernels. We also ran into some issues with large folios (e.g., me
> personally with s390x KVM guests) and trying our best to fix them.
We also worked on this. As you may recall, I previously fixed a large
folio bug, which was merged into the 6.1.y stable kernel [0].
[0]. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.1.y&id=a3f8ee15228c89ce3713ee7e8e82f6d8a13fdb4b
>
> It can be quite time consuming, so I can understand that not everybody
> has the time to invest into heavy debugging, especially if it's
> extremely hard to reproduce (or even corrupts data :( ).
Correct. If the vmcore is incomplete, it is nearly impossible to
reliably determine the root cause. The best approach is to isolate the
issue as quickly as possible.
>
> I agree that adding a toggle after the effects to work around issues is
> not the right approach. Introducing a EXPERIMENTAL toggle early because
> one suspects complicated interactions in a different story. It's
> absolutely not trivial to make that decision.
In this patchset, we are not introducing a toggle as a workaround.
Rather, the change reflects the fact that some workloads benefit from
THP while others are negatively impacted. Therefore, it makes sense to
enable THP selectively based on workload characteristics.
>
> >
> > Pardon me?
> > This discussion has taken such an unexpected turn that I don’t feel
> > the need to explain what I’ve contributed to the Linux community over
> > the past few years.
>
> I'm sure Zi Yan didn't mean to insult you. I would have phrased it as:
>
> "It's difficult to decide which toggles make sense. There is a fine line
> between adding a toggle and not getting people actually testing it to
> stabilize it vs. not adding a toggle and forcing people to test it and
> fix it/report issues."
>
> Ideally, we'd find most issue in the RC phase or at least shortly after.
>
> You've been active in the kernel for a long time, please don't feel like
> the community is not appreciating that.
Thank you for the clarification. I truly appreciate your patience and
thoroughness.
--
Regards
Yafang
next prev parent reply other threads:[~2025-05-05 2:35 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
2025-04-29 15:13 ` Zi Yan
2025-04-30 2:40 ` Yafang Shao
2025-04-30 12:11 ` Zi Yan
2025-04-30 14:43 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
2025-04-29 15:31 ` Zi Yan
2025-04-30 2:46 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
2025-04-29 15:19 ` Alexei Starovoitov
2025-04-30 2:48 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 4/4] selftests/bpf: Add selftest " Yafang Shao
2025-04-29 3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
2025-04-29 4:53 ` Yafang Shao
2025-04-29 15:09 ` Zi Yan
2025-04-30 2:33 ` Yafang Shao
2025-04-30 13:19 ` Zi Yan
2025-04-30 14:38 ` Yafang Shao
2025-04-30 15:00 ` Zi Yan
2025-04-30 15:16 ` Yafang Shao
2025-04-30 15:21 ` Liam R. Howlett
2025-04-30 15:37 ` Yafang Shao
2025-04-30 15:53 ` Liam R. Howlett
2025-04-30 16:06 ` Yafang Shao
2025-04-30 17:45 ` Johannes Weiner
2025-04-30 17:53 ` Zi Yan
2025-05-01 19:36 ` Gutierrez Asier
2025-05-02 5:48 ` Yafang Shao
2025-05-02 12:00 ` Zi Yan
2025-05-02 12:18 ` Yafang Shao
2025-05-02 13:04 ` David Hildenbrand
2025-05-02 13:06 ` Matthew Wilcox
2025-05-02 13:34 ` Zi Yan
2025-05-05 2:35 ` Yafang Shao [this message]
2025-05-05 9:11 ` Gutierrez Asier
2025-05-05 9:38 ` Yafang Shao
2025-04-30 17:59 ` Johannes Weiner
2025-05-01 0:40 ` Yafang Shao
2025-04-30 14:40 ` Liam R. Howlett
2025-04-30 14:49 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbA-PvLmzuv4oWenBRjvRM0KQiDa2nV0OsdTFr5czuLMUw@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).