* [RFC PATCH v1.1 2/2] Docs/admin-guide/mm/damon/stat: document kdamond_pid parameter
From: SeongJae Park @ 2026-04-14 23:59 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260414235912.98174-1-sj@kernel.org>
Update DAMON_STAT usage document for newly added kdamond_pid parameter.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/admin-guide/mm/damon/stat.rst | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst
index c4b14daeb2dd6..46c5dd96aa2ed 100644
--- a/Documentation/admin-guide/mm/damon/stat.rst
+++ b/Documentation/admin-guide/mm/damon/stat.rst
@@ -89,3 +89,10 @@ percentiles of the idle time values via this read-only parameter. Reading the
parameter returns 101 idle time values in milliseconds, separated by comma.
Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle
times.
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_STAT is enabled, this becomes the PID of the worker thread. Else, -1.
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 0/2] mm/damon/stat: add kdamond_pid parameter
From: SeongJae Park @ 2026-04-14 23:59 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
DAMON_STAT doesn't provide the pid of its kdamond, unlike DAMON_RECLAIM
and DAMON_LRU_SORT. This makes user-space management of DAMON_STAT
unnecessarily complicated. Provide the information via a new parameter,
namely kdamond_pid, and document it.
Changes from RFC
- rfc: https://lore.kernel.org/20260414053742.90296-1-sj@kernel.org
- Fix damon_kdamond_pid() failure handling.
SeongJae Park (2):
mm/damon/stat: add a parameter for reading kdamond pid
Docs/admin-guide/mm/damon/stat: document kdamond_pid parameter
Documentation/admin-guide/mm/damon/stat.rst | 7 +++++++
mm/damon/stat.c | 17 +++++++++++++++++
2 files changed, 24 insertions(+)
base-commit: 02784c37a710fa3c8c3e7be4f27a5cfa3356dc00
--
2.47.3
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: John Groves @ 2026-04-14 23:53 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, John Groves,
Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <20260414185740.GA604658@frogsfrogsfrogs>
On 26/04/14 11:57AM, Darrick J. Wong wrote:
> On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> > On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > Overall, my intention with bringing this up is just to make sure we're
> > > > at least aware of this alternative before anything is merged and
> > > > permanent. If Miklos and you think we should land this series, then
> > > > I'm on board with that.
> > >
> > > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > > absolutely necessary. This was the main sticking point originally,
> > > but there seemed to be no better alternative.
> > >
> > > However with the bpf approach this would be gone, which is great.
>
> Well... you can't get away with having *no* mapping interface at all.
> You still have to define a UABI that BPF programs can use to convey
> mapping data into fsdax/iomap. BTF is a nice piece of work that smooths
> over minor fluctuations in struct layout between a running kernel and
> a precompiled BPF program, but fundamentally we still need a fuse-native
> representation.
A couple of points here, that are really top level observations.
The call path from fuse into famfs largely looks like:
if (passthrough)
return passthrough_call()
else if (virtiofs)
return virtiofs_call()
else if (famfs)
return famfs_call()
So from a hooking in standpoint I was trying to be compliant.
Second point: iomap is an overloaded term. The famfs iomap usage is stolen
from xfs' fs-dax iomap call patterns. I *think* that is distinct from the
stuff called iomap that handles block I/O. Because maybe not everybody who
reads this will understand that famfs is, uh, kinda like hugetlbfs except
that the memory is from devdax (in 'famfs' mode, because the old mode
stopped working for file-backed maps. Famfs files are never sparse, and
they never use the page cache - which is super, super different from a
conventional file system.
the famfs_filemap_fault() path calls dax_iomap_fault() path (which I added
to devdax in the new famfs mode, because it was in pmem but not devdax)
always just updates a page table beause the page is always present. That
means that the fault path is SUPER PERFORMANCE CRITICAL because in heavy
use there can be millions of these faults per second - and with famfs there
is NEVER EVER a read from storage to amortize the call overhead over.
This is a super-important point. famfs_filemap_fault() is a in the
vm_operations_struct. It is called to remind the CPU where an address maps
to, because the TLB and PTE had been purged (which happens ALL THE TIME).
The ask here is to insert a BPF program as a vma fault handler. Can it work?
Probably. Will it perform? I HAVE NO IDEA, BUT THERE ARE REASONS TO WORRY
THAT IT MIGHT NOT.
I don't think this suggestion was made from a full understanding of the
performance requirements of this code path.
This is why we need a discussion with fs/mm/bpf experts. We should be able
to assemble an understanding of what the overhead of calling the BPF program
are and how many nanoseconds (or microseconds) that could possibly add.
Anything longer than the current famfs_filemap_fault() path is potentially
disastrous because the whole point of famfs is to expose memory via files,
and avoid sabotaging the performance.
An L3 cache miss costs 100ns in round numbers on fast local DRAM, and
3-5x as long on switched disaggregated memory. We cannot afford an expensive
code path resolving these mappings.
This is why, at the last two LSFMMs and in the famfs documentation, I said
things like "we're exposing memory, and it must run at memory speeds".
Famfs also registers with the memory provider (devdax in famfs mode) to
receive notifications of memory failures, and uses a 'holder_operations'
pattern copied from pmem. This stuff is not in generic iomap (correct me
if that's wrong).
And finally since I've core dumped quite a bit here, I'll go ahead and add
a thought experiment that *might* rule out using a BPF program as a vma
fault handler. Could we do that with hugetlbfs without damaging performance
for memory-intensive workloads? Hugetlbfs is a pretty solid stand-in for
famfs: it never does data-movement faults, it's never sparse, and it needs
to resolve TLB/PTE/PMD/PUD faults FAST.
>
> That last sentence was an indirect way of saying: No, we're not going
> to export struct iomap to userspace. The fuse-iomap patchset provides
> all the UABI pieces we need for regular filesystems (ext4) and hardware
> adjacent filesystems (famfs) to exchange file mapping data with the
> kernel. This has been out for review since last October, but the lack
> of engagement with that patchset (or its February resubmission) doesn't
> leave me with confidence that any of it is going anywhere.
>
> Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
> upload bpf programs to generate interleaved mappings. It's not so hard
> to convert famfs' iomapping paths to use fuse-iomap, but I haven't
> helped him do that because:
>
> a) I have no idea what Miklos' thoughts are about merging any of the
> famfs stuff.
>
> b) I also have no idea what his thoughts are about fuse-iomap. The
> sparse replies are not encouraging.
>
> c) It didn't seem fair to John to make him take on a whole new patchset
> dependency given (a) and (b).
>
> d) Nobody ever replied to my reply to the LSFMM thread about "can we do
> some code review of fuse iomap without waiting three months for LSFMM?"
> I've literally done nothing with fuse-iomap for two of the three months
> requested.
>
> > > So let us please at least have a try at this. I'm not into bpf yet,
> > > but willing to learn.
>
> I sent out the patches to enable exactly this sort of experimentation
> two months ago, and have not received any responses:
>
> https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
>
> I would like to say this as gently as possible: I don't know what the
> problem here is, Miklos -- are you uninterested in the work? Do you
> have too many other things to do inside RH that you can't talk about?
> Is it too difficult to figure out how the iomap stuff fits into the rest
> of the fuse codebase? Do you need help from the rest of us to get
> reviews done? Is there something else with which I could help?
>
> Because ... over the past few years, many of my team's filesystem
> projects have endured monthslong review cycles and often fail to get
> merged. This has led to burnout and frustration among my teammates such
> that many of them chose to move on to other things. For the remaining
> people, it was very difficult to justify continuing headcount when
> progress on projects is so slow that individuals cannot achieve even one
> milestone per quarter on any project.
>
> There's now nobody left here but me.
>
> I'm not blaming you (Miklos) for any of this, but that is the current
> deplorable state of things.
>
> > > Thanks,
> > > Miklos
> >
> > Thanks for responding...
> >
> > My short response: Noooooooooo!!!!!!
> >
> > I very strongly object to making this a prerequisite to merging. This
> > is an untested idea that will certainly delay us by at least a couple
> > of merge windows when products are shipping now, and the existing approach
> > has been in circulation for a long time. It is TOO LATE!!!!!!
>
> /me notes that has "we're shipping so you have to merge it over peoples'
> concerns" rarely carries the day in LKML land, and has never ended well
> in the few cases that it happens. As Ted is fond of saying, this is a
> team sport, not an individual effort. Unfortunately, to abuse your
> sports metaphor, we all play for the ******* A's.
That's totally fair. This process has been very long and grueling, and I'm
not always thinking clearly.
>
> That said, you're clearly pissed at the goalposts changing yet again,
> and that's really not fair that we collectively keep moving them.
>
> It's a rotten situation that I could have even helped you to solve both
> our problems via fuse-iomap, but I just couldn't motivate myself to
> entwine our two projects until the technical direction questions got
> answered.
>
> > Famfs is not a science project, it's enablement for actual products and
> > early versions are available now!!!
> >
> > That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
>
> Heck, the fuse command field is a u32. There are plenty of numberspace
> left, and the kernel can just *stop issuing them*.
>
> > What are the risks of converting to BPF?
> >
> > - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> > curve cost about a year because this is not that similar to anything
> > else that was already in fuse.
>
> ...and per above, BPF isn't some magic savior that avoids the expansion
> of the UABI.
>
> > - Those of us who are involved don't fully understand either the security
> > or performance implications of this. It
>
> Correct. I sure think it's swell that people can inject IR programs
> that jit/link into the kernel. Don't ask which secondary connotation of
> "swell" I'm talking about.
>
> > - Famfs is enabling access to memory and mapping fault handling must be
> > at "memory speed". We know that BPF walks some data structures when a
> > program executes. That exposes us to additional serialized L3 cache
> > misses each time we service a mapping fault (any TLB & page table miss).
> > This should be studied side-by-side with the existing approach under
> > multiple loads before being adopted for production.
>
> Yes, it should. AFAICT if one switched to a per-inode bpf program, then
> you could do per-inode bpf programs. Then you don't even need the bpf
> map, and the ->iomap_begin becomes an indirect call into JITted x86_64
> math code.
>
> (The downside is that dyn code can't be meaningfully signed, requires
> clang on the system, and you have to deal with inode eviction issues.)
>
> > - This has never been done in production, and we're throwing it in the way
> > of a project that has been soaking for years and needs to support early
> > shipments of products.
>
> Correct. I haven't even implemented BPF-iomap for fuse4fs. This BPF
> integration stuff is *highly* experimental code.
>
> > If this is the only path, I'd like to revive famfs as a standalone file
> > system. I'm still maintaining that and it's still in use.
>
> Honestly, you should probably just ship that to your users. As long as
> the ondisk format doesn't change much, switching the implementation at a
> later date is at least still possible.
>
> --D
And apologies to the polite universe for being a bit raw earlier. Getting
this far has been quite a grind...
Thanks,
John
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-14 23:37 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <CAEvNRgFkusZeKxGctUpTTbYjdi7nZL1ZZar-gT7XRUOCZ2xtpw@mail.gmail.com>
On Wed, Apr 01, 2026 at 03:38:12PM -0700, Ackerley Tng wrote:
> Michael Roth <michael.roth@amd.com> writes:
>
> >
> > [...snip...]
> >
> >> static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> >> {
> >> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >> return -EINVAL;
> >> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> >> return -EINVAL;
> >> + if (attrs->error_offset)
> >> + return -EINVAL;
> >> for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> >> if (attrs->reserved[i])
> >> return -EINVAL;
> >> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >> return 1;
> >> case KVM_CAP_GUEST_MEMFD_FLAGS:
> >> return kvm_gmem_get_supported_flags(kvm);
> >> + case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> >> + if (vm_memory_attributes)
> >> + return 0;
> >> +
> >> + return kvm_supported_mem_attributes(kvm);
> >
> > Based on the discussion from the PUCK call this morning,
>
> Thanks for copying the discussion here, I'll start attending PUCK to
> catch those discussions too :)
>
> > it sounds like it
> > would be a good idea to limit kvm_supported_mem_attributes() to only
> > reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
> > implementation has all the necessary enablement to support in-place
> > conversion via guest_memfd. In the case of SNP, there is a
> > documentation/parameter check in snp_launch_update() that needs to be
> > relaxed in order for userspace to be able to pass in a NULL 'src'
> > parameter (since, for in-place conversion, it would be initialized in place
> > as shared memory prior to the call, since by the time kvm_gmem_poulate()
> > it will have been set to private and therefore cannot be faulted in via
> > GUP (and if it could, we'd be unecessarily copying the src back on top
> > of itself since src/dst are the same).
>
> Could this be a separate thing? If I'm understanding you correctly, it's
> not strictly a requirement for snp_launch_update() to first support a
> NULL 'src' parameter before this series lands.
I think we are already sync'd up on this during PUCK, but for the benefit
of others: Sean pointed out that if we don't then we'll need to add yet
another capability so userspace can determine when it can actually do
in-place conversion for SNP.
Right now, this series effectively advertises in place conversion at the
point where KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports
'KVM_MEMORY_ATTRIBUTE_PRIVATE', so I slightly reworked the series to
include the snp_launch_update() change prior to that point in time in
the series. Thanks to prereqs and changes/requirements you've already
pulled in, it's just one additional patch now:
KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
I also did some minor updates (prefixed with a "[squash]" tag) to advertise
the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
during PUCK.
The branch is here, with the patches moved to where I think they
should remain (or be squashed in for the [squash] ones):
https://github.com/AMDESE/linux/commits/guest_memfd-inplace-conversion-v4-snp2/
I've also updated the QEMU patches to use the agreed-upon API flow and
pushed them here:
https://github.com/AMDESE/qemu/commits/snp-inplace-for-v4-wip2/
To start an SNP guest with in-place conversion:
qemu-system-x86 \
-machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
-object sev-snp-guest,id=sev0,...,convert-in-place=true \
-object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false
To start an normal non-CoCo guest backed by guest_memfd with shared memory:
qemu-system-x86 \
-machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
-object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false
Thanks,
Mike
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: Darrick J. Wong @ 2026-04-14 23:36 UTC (permalink / raw)
To: Joanne Koong
Cc: John Groves, Miklos Szeredi, Bernd Schubert, John Groves,
Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <CAJnrk1ZgcMuwfMpT1fXvUwBBiq9eWFHWVeOFQFFKiamGGe1RJg@mail.gmail.com>
On Tue, Apr 14, 2026 at 03:13:57PM -0700, Joanne Koong wrote:
> On Tue, Apr 14, 2026 at 11:57 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> > > On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > > > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > > Overall, my intention with bringing this up is just to make sure we're
> > > > > at least aware of this alternative before anything is merged and
> > > > > permanent. If Miklos and you think we should land this series, then
> > > > > I'm on board with that.
> > > >
> > > > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > > > absolutely necessary. This was the main sticking point originally,
> > > > but there seemed to be no better alternative.
> > > >
> > > > However with the bpf approach this would be gone, which is great.
> >
> > Well... you can't get away with having *no* mapping interface at all.
>
> Yes but the mapping interface should be *generic*, not one that is so
> specifically tailored to one server. fuse will have to support this
> forever.
<nod> On second thought, there's a way to read Miklos' sentence that I
hadn't thought of before:
"However, with the [fuse-iomap] bpf approach, this [famfs specific
mapping interface] would be gone, which is great."
vs. the way I had thought:
"However, with the bpf approach, this [famfs specific mapping interface]
would be gone [in favor of filling out a struct iomap directly], which
is great."
So maybe Miklos actually /has/ at least read all the way through the
February posting, though I have no data to make such a conclusion. :/
> > You still have to define a UABI that BPF programs can use to convey
> > mapping data into fsdax/iomap. BTF is a nice piece of work that smooths
> > over minor fluctuations in struct layout between a running kernel and
> > a precompiled BPF program, but fundamentally we still need a fuse-native
> > representation.
> >
> > That last sentence was an indirect way of saying: No, we're not going
> > to export struct iomap to userspace. The fuse-iomap patchset provides
> > all the UABI pieces we need for regular filesystems (ext4) and hardware
> > adjacent filesystems (famfs) to exchange file mapping data with the
> > kernel. This has been out for review since last October, but the lack
> > of engagement with that patchset (or its February resubmission) doesn't
> > leave me with confidence that any of it is going anywhere.
> >
> > Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
> > upload bpf programs to generate interleaved mappings. It's not so hard
> > to convert famfs' iomapping paths to use fuse-iomap, but I haven't
> > helped him do that because:
> >
> > a) I have no idea what Miklos' thoughts are about merging any of the
> > famfs stuff.
> >
> > b) I also have no idea what his thoughts are about fuse-iomap. The
> > sparse replies are not encouraging.
> >
> > c) It didn't seem fair to John to make him take on a whole new patchset
> > dependency given (a) and (b).
> >
> > d) Nobody ever replied to my reply to the LSFMM thread about "can we do
> > some code review of fuse iomap without waiting three months for LSFMM?"
> > I've literally done nothing with fuse-iomap for two of the three months
> > requested.
> >
> > > > So let us please at least have a try at this. I'm not into bpf yet,
> > > > but willing to learn.
> >
> > I sent out the patches to enable exactly this sort of experimentation
> > two months ago, and have not received any responses:
> >
> > https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
> >
> > I would like to say this as gently as possible: I don't know what the
> > problem here is, Miklos -- are you uninterested in the work? Do you
> > have too many other things to do inside RH that you can't talk about?
> > Is it too difficult to figure out how the iomap stuff fits into the rest
> > of the fuse codebase? Do you need help from the rest of us to get
> > reviews done? Is there something else with which I could help?
> >
> > Because ... over the past few years, many of my team's filesystem
> > projects have endured monthslong review cycles and often fail to get
> > merged. This has led to burnout and frustration among my teammates such
> > that many of them chose to move on to other things. For the remaining
> > people, it was very difficult to justify continuing headcount when
> > progress on projects is so slow that individuals cannot achieve even one
> > milestone per quarter on any project.
> >
> > There's now nobody left here but me.
> >
> > I'm not blaming you (Miklos) for any of this, but that is the current
> > deplorable state of things.
> >
> > > > Thanks,
> > > > Miklos
> > >
> > > Thanks for responding...
> > >
> > > My short response: Noooooooooo!!!!!!
> > >
> > > I very strongly object to making this a prerequisite to merging. This
> > > is an untested idea that will certainly delay us by at least a couple
> > > of merge windows when products are shipping now, and the existing approach
> > > has been in circulation for a long time. It is TOO LATE!!!!!!
> >
> > /me notes that has "we're shipping so you have to merge it over peoples'
> > concerns" rarely carries the day in LKML land, and has never ended well
> > in the few cases that it happens. As Ted is fond of saying, this is a
> > team sport, not an individual effort. Unfortunately, to abuse your
> > sports metaphor, we all play for the ******* A's.
> >
> > That said, you're clearly pissed at the goalposts changing yet again,
> > and that's really not fair that we collectively keep moving them.
> >
> > It's a rotten situation that I could have even helped you to solve both
> > our problems via fuse-iomap, but I just couldn't motivate myself to
> > entwine our two projects until the technical direction questions got
> > answered.
> >
> > > Famfs is not a science project, it's enablement for actual products and
> > > early versions are available now!!!
> > >
> > > That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
> >
> > Heck, the fuse command field is a u32. There are plenty of numberspace
> > left, and the kernel can just *stop issuing them*.
>
> I don't think the problem is the command field. As I understand it, if
> this lands and is converted over later, none of the famfs code in this
> series can be removed from fuse. If fuse has native non-bpf support
> for famfs, then it will always need to have that. That's the part that
> worries me.
>
> >
> > > What are the risks of converting to BPF?
>
> I think maybe there is a misinterpretation of what the alternative
> approach entails. From my point of view, the alternative approach is
> not that different from what is already in this series. The only piece
> of the famfs logic that would need to use bpf is the logic for
> finding/computing the extent mappings (which is the famfs-specific
> logic that would not be applicable to any other server). That famfs
> bpf code is minimal and already written [1], as it is just the logic
Remember where struct fuse_iomap_io came from -- the fuse-iomap
patchset. It would be rather odd to start accepting fuse_iomap_io
objects from a user's bpf program without examining the rest of the fuse
iomap stuff.
> that is in patch 6 [2] in this series copied over. No other part of
> famfs touches bpf. The rest is renaming the functions in
> fs/fuse/famfs.c to generic fuse_iomap_dax_XXX names (the logic is the
> same logic in this series, eg invoking the lower-level calls to
> dax_iomap_rw/fault/etc) and moving the daxdev setup/initialization to
> connection initialization time where the server passes that daxdev
> setup info/configs upfront. I don't think this would delay things by
> several merge windows, as the code is already mostly written. If it
> would be helpful, I can clean up what's in the prototype and send that
> out.
I agree that you and I and John could probably get the code and review
part wrapped up in perhaps two merge windows -- one for fuse-iomap,
and the second for famfs. The userspace parts of both are more or less
done, which would minimize the amount of rework when we get to the
libfuse part.
(Let's be honest, with LSFMM happening during the week between -rc2 and
-rc3 and everyone's travel thereto, that's going to blow a big hole in
the 7.2 schedule)
The question is, would Miklos acquiesce to merging a large ball of code
that the three of us have been collaborating on? Even if he wasn't
deeply involved in that collaboration?
> I think the part that is not clear yet and needs to be verified is
> whether this approach runs into any technical limitations on famfs's
> production workloads. For example, does the overhead of using bpf maps
> lead to a noticeable performance drop on real workloads? In the
I see a custom hashtable map implementation in kernel/bpf/hashtab.c,
and no particular evidence that it can rehash itself to cut down on
bucket list chasing. That's too bad, because rhashtable rehashing is
generally effective at keeping the xfs icache pointer chasing down.
If we have a per-inode famfs_file_meta object, I wonder if we could just
attach it to the fuse_inode as a void *private pointer? That wouldn't
be any worse than current famfs.
> future, will there be too many extent mappings on high-scale systems
> to make this feasible? etc. If there are technical reasons why the
I've asked that question (are we going to have millions of mappings?)
before. From what John has told me and what I've seen with cxl and pmem
devices before that, the memory manager is heavily incentivized to give
out large static(ish) allocations to constrain the metadata overhead,
enable the use of PMD/PGD TLB entries, and minimize pointer chasing
through mapping structures.
The only reason we let that happen in the disk filesystems is that the
IO service times are so high nobody cares about L3 misses.
> famfs logic has to be in fuse, then imo we should figure that out and
> ideally that's the discussion we should be having. I am not a cxl
> expert so perhaps there is something missing in the approach that
> makes it not sufficient on production systems. If we don't end up
> going with the alternative approach, I still think this series should
> try to make the famfs uapi additions to fuse as generic as possible
> since that will be irreversible.
<nod>
> If we expedited the alternative approach in terms of reviewing and
> merging, would that suffice? Is the main pushback the timing of it, eg
> that it would take too long to get reviewed, merged, and shipped?
I think John's been pretty clear that he doesn't want to drag this out
even a day longer. Given current trends this month, I might run out of
time soon too.
> > > - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> > > curve cost about a year because this is not that similar to anything
> > > else that was already in fuse.
> >
> > ...and per above, BPF isn't some magic savior that avoids the expansion
> > of the UABI.
>
> It doesn't avoid the expansion of the UABI but it makes the UABI
> generic (eg plenty of future servers can/will use the generic iomap
> layer).
(Oh good, nobody's talking about going the evil route and just fill out
struct iomap directly!)
> >
> > > - Those of us who are involved don't fully understand either the security
> > > or performance implications of this. It
> >
> > Correct. I sure think it's swell that people can inject IR programs
> > that jit/link into the kernel. Don't ask which secondary connotation of
> > "swell" I'm talking about.
>
> bpf is used elsewhere in the kernel (eg networking, scheduling). If it
> is the case that it is unsafe (which maybe it is, I don't know), then
> wouldn't those other areas have the same issues?
Well ok, here we go -- I don't think there's any serious technical
problems with BPF. The ability to read (and in some cases write) to
kernel memory looks like it's flexible enough to do the classification
and data collection stuff that most current bpf users want to do.
The issues I was alluding to are BPF being used as a means to get around
slow/unresponsive maintainers; and the kernel community's collective
refusal to explore any other path to building new user APIs besides
designing everything generically perfectly up front in the kernel UABI
along with all the stress that involves.
Once upon a time I tried to push on these UAPI stressfulness issues and
Linus told me I had a loose grip on reality. He's probably right.
> > > - Famfs is enabling access to memory and mapping fault handling must be
> > > at "memory speed". We know that BPF walks some data structures when a
> > > program executes. That exposes us to additional serialized L3 cache
> > > misses each time we service a mapping fault (any TLB & page table miss).
> > > This should be studied side-by-side with the existing approach under
> > > multiple loads before being adopted for production.
> >
> > Yes, it should. AFAICT if one switched to a per-inode bpf program, then
> > you could do per-inode bpf programs. Then you don't even need the bpf
> > map, and the ->iomap_begin becomes an indirect call into JITted x86_64
> > math code.
> >
> > (The downside is that dyn code can't be meaningfully signed, requires
> > clang on the system, and you have to deal with inode eviction issues.)
> >
> > > - This has never been done in production, and we're throwing it in the way
> > > of a project that has been soaking for years and needs to support early
> > > shipments of products.
> >
> > Correct. I haven't even implemented BPF-iomap for fuse4fs. This BPF
> > integration stuff is *highly* experimental code.
>
> I think what fuse4fs needs for bpf is significantly more complicated
> and intensive than what famfs needs. For famfs, the extent mapping
> logic is straightforward computation.
Agreed. For fuse4fs I'm content to let it manage the iomap cache.
> > > If this is the only path, I'd like to revive famfs as a standalone file
> > > system. I'm still maintaining that and it's still in use.
> >
> > Honestly, you should probably just ship that to your users. As long as
> > the ondisk format doesn't change much, switching the implementation at a
> > later date is at least still possible.
>
> I recognize this is an unfair situation John as you've already spent
> years working on this and did what the community asked with rewriting
> it. What I'm hoping to convey is that the approach where the extent
> computing/finding logic gets moved to bpf is not radically different
> from the famfs logic already in this patchset. In my view, moving this
> logic to bpf is more advantageous for both fuse *and* famfs
> (decoupling famfs releases from kernel releases) - it would be great
> to consider this on technical merits if expediting the timeline of the
> alternative approach would suffice.
>
> Thanks,
> Joanne
>
> [1] https://github.com/joannekoong/libfuse/blob/444fa27fa9fd2118a0dc332933197faf9bbf25aa/example/famfs.bpf.c
> [2] https://lore.kernel.org/linux-fsdevel/0100019d43e79794-0eadcf5e-b659-43f7-8fdc-dec9f4ccce14-000000@email.amazonses.com/
>
> >
> > --D
>
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: Gregory Price @ 2026-04-14 22:20 UTC (permalink / raw)
To: Darrick J. Wong
Cc: John Groves, Miklos Szeredi, Joanne Koong, Bernd Schubert,
John Groves, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Shuah Khan, Vishal Verma,
Dave Jiang, Matthew Wilcox, Jan Kara, Alexander Viro,
David Hildenbrand, Christian Brauner, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <20260414185740.GA604658@frogsfrogsfrogs>
On Tue, Apr 14, 2026 at 11:57:40AM -0700, Darrick J. Wong wrote:
> >
> > I very strongly object to making this a prerequisite to merging. This
> > is an untested idea that will certainly delay us by at least a couple
> > of merge windows when products are shipping now, and the existing approach
> > has been in circulation for a long time. It is TOO LATE!!!!!!
>
...
>
> That said, you're clearly pissed at the goalposts changing yet again,
> and that's really not fair that we collectively keep moving them.
>
This seems a bit more than moving a goalpost.
We're now gating working software, for real working hardware, on a novel,
unproven BPF ops structure that controls page table mappings on page table
faults which would be used by exactly 1 user : FAMFS.
And that singular user is harmed because it turns an O(1) offset
calculation into a pointer chase - on the hottest path (every fault).
John is right to push back here.
---
That said - I'm looking at fs/fuse/famfs.c and I'm asking myself what in
here is actually famfs-specific. If you just s/FAMFS/DAX/g - the file
just reads like a simple DAX-iomap backend with optional striping.
Would it be reasonable to refactor the dax layer (and users) to
create an ops structure that becomes the basis for the BPF solution?
We don't even know what the whole BPF scope is, and it seems wholly
unfair to John's and his users to make that solely their problem (for
negative value!).
~Gregory
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: Joanne Koong @ 2026-04-14 22:13 UTC (permalink / raw)
To: Darrick J. Wong
Cc: John Groves, Miklos Szeredi, Bernd Schubert, John Groves,
Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <20260414185740.GA604658@frogsfrogsfrogs>
On Tue, Apr 14, 2026 at 11:57 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> > On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > Overall, my intention with bringing this up is just to make sure we're
> > > > at least aware of this alternative before anything is merged and
> > > > permanent. If Miklos and you think we should land this series, then
> > > > I'm on board with that.
> > >
> > > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > > absolutely necessary. This was the main sticking point originally,
> > > but there seemed to be no better alternative.
> > >
> > > However with the bpf approach this would be gone, which is great.
>
> Well... you can't get away with having *no* mapping interface at all.
Yes but the mapping interface should be *generic*, not one that is so
specifically tailored to one server. fuse will have to support this
forever.
> You still have to define a UABI that BPF programs can use to convey
> mapping data into fsdax/iomap. BTF is a nice piece of work that smooths
> over minor fluctuations in struct layout between a running kernel and
> a precompiled BPF program, but fundamentally we still need a fuse-native
> representation.
>
> That last sentence was an indirect way of saying: No, we're not going
> to export struct iomap to userspace. The fuse-iomap patchset provides
> all the UABI pieces we need for regular filesystems (ext4) and hardware
> adjacent filesystems (famfs) to exchange file mapping data with the
> kernel. This has been out for review since last October, but the lack
> of engagement with that patchset (or its February resubmission) doesn't
> leave me with confidence that any of it is going anywhere.
>
> Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
> upload bpf programs to generate interleaved mappings. It's not so hard
> to convert famfs' iomapping paths to use fuse-iomap, but I haven't
> helped him do that because:
>
> a) I have no idea what Miklos' thoughts are about merging any of the
> famfs stuff.
>
> b) I also have no idea what his thoughts are about fuse-iomap. The
> sparse replies are not encouraging.
>
> c) It didn't seem fair to John to make him take on a whole new patchset
> dependency given (a) and (b).
>
> d) Nobody ever replied to my reply to the LSFMM thread about "can we do
> some code review of fuse iomap without waiting three months for LSFMM?"
> I've literally done nothing with fuse-iomap for two of the three months
> requested.
>
> > > So let us please at least have a try at this. I'm not into bpf yet,
> > > but willing to learn.
>
> I sent out the patches to enable exactly this sort of experimentation
> two months ago, and have not received any responses:
>
> https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
>
> I would like to say this as gently as possible: I don't know what the
> problem here is, Miklos -- are you uninterested in the work? Do you
> have too many other things to do inside RH that you can't talk about?
> Is it too difficult to figure out how the iomap stuff fits into the rest
> of the fuse codebase? Do you need help from the rest of us to get
> reviews done? Is there something else with which I could help?
>
> Because ... over the past few years, many of my team's filesystem
> projects have endured monthslong review cycles and often fail to get
> merged. This has led to burnout and frustration among my teammates such
> that many of them chose to move on to other things. For the remaining
> people, it was very difficult to justify continuing headcount when
> progress on projects is so slow that individuals cannot achieve even one
> milestone per quarter on any project.
>
> There's now nobody left here but me.
>
> I'm not blaming you (Miklos) for any of this, but that is the current
> deplorable state of things.
>
> > > Thanks,
> > > Miklos
> >
> > Thanks for responding...
> >
> > My short response: Noooooooooo!!!!!!
> >
> > I very strongly object to making this a prerequisite to merging. This
> > is an untested idea that will certainly delay us by at least a couple
> > of merge windows when products are shipping now, and the existing approach
> > has been in circulation for a long time. It is TOO LATE!!!!!!
>
> /me notes that has "we're shipping so you have to merge it over peoples'
> concerns" rarely carries the day in LKML land, and has never ended well
> in the few cases that it happens. As Ted is fond of saying, this is a
> team sport, not an individual effort. Unfortunately, to abuse your
> sports metaphor, we all play for the ******* A's.
>
> That said, you're clearly pissed at the goalposts changing yet again,
> and that's really not fair that we collectively keep moving them.
>
> It's a rotten situation that I could have even helped you to solve both
> our problems via fuse-iomap, but I just couldn't motivate myself to
> entwine our two projects until the technical direction questions got
> answered.
>
> > Famfs is not a science project, it's enablement for actual products and
> > early versions are available now!!!
> >
> > That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
>
> Heck, the fuse command field is a u32. There are plenty of numberspace
> left, and the kernel can just *stop issuing them*.
I don't think the problem is the command field. As I understand it, if
this lands and is converted over later, none of the famfs code in this
series can be removed from fuse. If fuse has native non-bpf support
for famfs, then it will always need to have that. That's the part that
worries me.
>
> > What are the risks of converting to BPF?
I think maybe there is a misinterpretation of what the alternative
approach entails. From my point of view, the alternative approach is
not that different from what is already in this series. The only piece
of the famfs logic that would need to use bpf is the logic for
finding/computing the extent mappings (which is the famfs-specific
logic that would not be applicable to any other server). That famfs
bpf code is minimal and already written [1], as it is just the logic
that is in patch 6 [2] in this series copied over. No other part of
famfs touches bpf. The rest is renaming the functions in
fs/fuse/famfs.c to generic fuse_iomap_dax_XXX names (the logic is the
same logic in this series, eg invoking the lower-level calls to
dax_iomap_rw/fault/etc) and moving the daxdev setup/initialization to
connection initialization time where the server passes that daxdev
setup info/configs upfront. I don't think this would delay things by
several merge windows, as the code is already mostly written. If it
would be helpful, I can clean up what's in the prototype and send that
out.
I think the part that is not clear yet and needs to be verified is
whether this approach runs into any technical limitations on famfs's
production workloads. For example, does the overhead of using bpf maps
lead to a noticeable performance drop on real workloads? In the
future, will there be too many extent mappings on high-scale systems
to make this feasible? etc. If there are technical reasons why the
famfs logic has to be in fuse, then imo we should figure that out and
ideally that's the discussion we should be having. I am not a cxl
expert so perhaps there is something missing in the approach that
makes it not sufficient on production systems. If we don't end up
going with the alternative approach, I still think this series should
try to make the famfs uapi additions to fuse as generic as possible
since that will be irreversible.
If we expedited the alternative approach in terms of reviewing and
merging, would that suffice? Is the main pushback the timing of it, eg
that it would take too long to get reviewed, merged, and shipped?
> >
> > - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> > curve cost about a year because this is not that similar to anything
> > else that was already in fuse.
>
> ...and per above, BPF isn't some magic savior that avoids the expansion
> of the UABI.
It doesn't avoid the expansion of the UABI but it makes the UABI
generic (eg plenty of future servers can/will use the generic iomap
layer).
>
> > - Those of us who are involved don't fully understand either the security
> > or performance implications of this. It
>
> Correct. I sure think it's swell that people can inject IR programs
> that jit/link into the kernel. Don't ask which secondary connotation of
> "swell" I'm talking about.
bpf is used elsewhere in the kernel (eg networking, scheduling). If it
is the case that it is unsafe (which maybe it is, I don't know), then
wouldn't those other areas have the same issues?
>
> > - Famfs is enabling access to memory and mapping fault handling must be
> > at "memory speed". We know that BPF walks some data structures when a
> > program executes. That exposes us to additional serialized L3 cache
> > misses each time we service a mapping fault (any TLB & page table miss).
> > This should be studied side-by-side with the existing approach under
> > multiple loads before being adopted for production.
>
> Yes, it should. AFAICT if one switched to a per-inode bpf program, then
> you could do per-inode bpf programs. Then you don't even need the bpf
> map, and the ->iomap_begin becomes an indirect call into JITted x86_64
> math code.
>
> (The downside is that dyn code can't be meaningfully signed, requires
> clang on the system, and you have to deal with inode eviction issues.)
>
> > - This has never been done in production, and we're throwing it in the way
> > of a project that has been soaking for years and needs to support early
> > shipments of products.
>
> Correct. I haven't even implemented BPF-iomap for fuse4fs. This BPF
> integration stuff is *highly* experimental code.
I think what fuse4fs needs for bpf is significantly more complicated
and intensive than what famfs needs. For famfs, the extent mapping
logic is straightforward computation.
>
> > If this is the only path, I'd like to revive famfs as a standalone file
> > system. I'm still maintaining that and it's still in use.
>
> Honestly, you should probably just ship that to your users. As long as
> the ondisk format doesn't change much, switching the implementation at a
> later date is at least still possible.
I recognize this is an unfair situation John as you've already spent
years working on this and did what the community asked with rewriting
it. What I'm hoping to convey is that the approach where the extent
computing/finding logic gets moved to bpf is not radically different
from the famfs logic already in this patchset. In my view, moving this
logic to bpf is more advantageous for both fuse *and* famfs
(decoupling famfs releases from kernel releases) - it would be great
to consider this on technical merits if expediting the timeline of the
alternative approach would suffice.
Thanks,
Joanne
[1] https://github.com/joannekoong/libfuse/blob/444fa27fa9fd2118a0dc332933197faf9bbf25aa/example/famfs.bpf.c
[2] https://lore.kernel.org/linux-fsdevel/0100019d43e79794-0eadcf5e-b659-43f7-8fdc-dec9f4ccce14-000000@email.amazonses.com/
>
> --D
^ permalink raw reply
* Re: [PATCH v2 04/12] tick/nohz: Transition to dynamic full dynticks state management
From: Thomas Gleixner @ 2026-04-14 21:57 UTC (permalink / raw)
To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-4-06df21caba5d@gmail.com>
On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:
> Context:
> Full dynticks (NOHZ_FULL) is typically a static configuration determined
> at boot time. DHEI extends this to support runtime activation.
I have no idea what DHEI is. Provide proper information and not magic
acronyms.
> Problem:
> Switching to NOHZ_FULL at runtime requires careful synchronization
> of context tracking and housekeeping states. Re-invoking setup logic
> multiple times could lead to inconsistencies or warnings, and RCU
> dependency checks often prevented tick suppression in Zero-Conf setups.
And that careful synchronization is best achieved with an opaque
notifier callchain which relies on build time ordering. Impressive.
> Solution:
> - Replace the static tick_nohz_full_enabled() checks with a dynamic
> tick_nohz_full_running state variable.
That variable existed before and you are telling the what and not why
this is required and how that is correct vs. the other checks in
tick_nohz_full_enabled(). Also what's static about that function aside
of being marked static inline?
> - Refactor tick_nohz_full_setup to be safe for runtime invocation,
> adding guards against re-initialization and ensuring IRQ work
> interrupt support.
Refactoring has to be done in a preparatory patch and not
> - Implement boot-time pre-activation of context tracking (shadow
> init) for all possible CPUs to avoid instruction flow issues during
> dynamic transitions.
Again lot's of hand waving without a proper explanation.
> - Hook into housekeeping_notifier_list to update NO_HZ states dynamically.
See above.
> This provides the core state machine for reliable, on-demand tick
> suppression and high-performance isolation.
I can find a lot of hacks, but definitely not the slightest notion of a
state machine. Don't throw random buzzwords into a changelog if there is
no evidence for their existance.
> +static int tick_nohz_housekeeping_reconfigure(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + struct housekeeping_update *upd = data;
> + int cpu;
> +
> + if (action == HK_UPDATE_MASK && upd->type == HK_TYPE_TICK) {
> + cpumask_var_t non_housekeeping_mask;
> +
> + if (!alloc_cpumask_var(&non_housekeeping_mask, GFP_KERNEL))
> + return NOTIFY_BAD;
> +
> + cpumask_andnot(non_housekeeping_mask, cpu_possible_mask, upd->new_mask);
> +
> + if (!tick_nohz_full_mask) {
> + if (!zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
> + free_cpumask_var(non_housekeeping_mask);
> + return NOTIFY_BAD;
> + }
> + }
> +
> + /* Kick all CPUs to re-evaluate tick dependency before change */
> + for_each_online_cpu(cpu)
> + tick_nohz_full_kick_cpu(cpu);
That solves what?
> + cpumask_copy(tick_nohz_full_mask, non_housekeeping_mask);
What's the exact point of this non_housekeeping_mask?
Why can't you simply do:
cpumask_andnot(tick_nohz_full_mask, cpu_possible_mask, upd->new_mask);
That'd be too simple and comprehensible, right?
> + tick_nohz_full_running = !cpumask_empty(tick_nohz_full_mask);
> +
> + /*
> + * If nohz_full is running, the timer duty must be on a housekeeper.
> + * If the current timer CPU is not a housekeeper, or no duty is assigned,
> + * pick the first housekeeper and assign it.
> + */
> + if (tick_nohz_full_running) {
> + int timer_cpu = READ_ONCE(tick_do_timer_cpu);
New line between declaration and code.
> + if (timer_cpu == TICK_DO_TIMER_NONE ||
> + !cpumask_test_cpu(timer_cpu, upd->new_mask)) {
No line break required. You have 100 characters
> + int next_timer = cpumask_first(upd->new_mask);
next_timer? Please pick variable names which are comprehensible and self
explaining. Also why can't you re-use timer_cpu, which would be actually useful?
> + if (next_timer < nr_cpu_ids)
How can upd->new_mask be empty? That'd be a bug, no?
> + WRITE_ONCE(tick_do_timer_cpu, next_timer);
> + }
> + }
> +
> + /* Kick all CPUs again to apply new nohz full state */
> + for_each_online_cpu(cpu)
> + tick_nohz_full_kick_cpu(cpu);
This whole thing lacks an explanation why it is even remotely correct.
> void __init tick_nohz_init(void)
...
> + if (!tick_nohz_full_mask) {
> + if (!slab_is_available())
> + alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> + else
> + zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL);
> }
I've seen the same code sequence before. Copy & paste is simpler than
providing helper functions.....
> - if (IS_ENABLED(CONFIG_PM_SLEEP_SMP) &&
> - !IS_ENABLED(CONFIG_PM_SLEEP_SMP_NONZERO_CPU)) {
> - cpu = smp_processor_id();
> + housekeeping_register_notifier(&tick_nohz_housekeeping_nb);
>
> - if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
> - pr_warn("NO_HZ: Clearing %d from nohz_full range "
> - "for timekeeping\n", cpu);
> - cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> + if (tick_nohz_full_running) {
This indentation and the resulting goto mess can be completely avoided
if you actually refactor the code and not just claim to do so.
Again, this does too many things at once and then explains them badly,
which makes it unreviewable.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v2 02/12] sched/isolation: Introduce housekeeping notifier infrastructure
From: Thomas Gleixner @ 2026-04-14 21:25 UTC (permalink / raw)
To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-2-06df21caba5d@gmail.com>
On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:
>
> +int housekeeping_register_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&housekeeping_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_register_notifier);
> +
> +int housekeeping_unregister_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_unregister(&housekeeping_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_unregister_notifier);
As I said before, notifiers are a horrible interface especially for
things where most callers are built-in. Especially providing proper
ordering of the callbacks is a badly defined mechanism as demonstrated
by the now eliminated CPU hotplug notifiers.
> +int housekeeping_update_notify(enum hk_type type, const struct cpumask *new_mask)
> +{
> + struct housekeeping_update update = {
> + .type = type,
> + .new_mask = new_mask,
> + };
> +
> + return blocking_notifier_call_chain(&housekeeping_notifier_list, HK_UPDATE_MASK, &update);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_update_notify);
Why is this exported? Are random modules allowed to invoke this?
^ permalink raw reply
* Re: [PATCH v2 05/12] genirq: Support dynamic migration for managed interrupts
From: Thomas Gleixner @ 2026-04-14 21:21 UTC (permalink / raw)
To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-5-06df21caba5d@gmail.com>
On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:
> + irq_lock_sparse();
> + for_each_active_irq(irq) {
> + struct irq_data *irqd;
Please move the declaration into the scope where it is used.
> + struct irq_desc *desc;
> +
> + desc = irq_to_desc(irq);
> + if (!desc)
> + continue;
> +
> + scoped_guard(raw_spinlock_irqsave, &desc->lock) {
> + irqd = irq_desc_get_irq_data(desc);
> + if (!irqd_affinity_is_managed(irqd) || !desc->action ||
> + !irq_data_get_irq_chip(irqd))
> + continue;
That's a pretty random choice of conditions.
> + /*
> + * Re-apply existing affinity to honor the new
> + * housekeeping mask via __irq_set_affinity() logic.
> + */
> + irq_set_affinity_locked(irqd, irq_data_get_affinity_mask(irqd), false);
That's not sufficient. Assume an interrupt was shut down before the
change because there was no online CPU in the affinity mask, but now the
affinity mask changes so there is an online CPU. What starts it up?
Same the other way around.
> +static struct notifier_block irq_housekeeping_nb = {
> + .notifier_call = irq_housekeeping_reconfigure,
> +};
> +
> +static int __init irq_init_housekeeping_notifier(void)
> +{
> + housekeeping_register_notifier(&irq_housekeeping_nb);
> + return 0;
> +}
> +core_initcall(irq_init_housekeeping_notifier);
I fundamentaly despise notifiers especially when they are just invoking
something which is built in.
^ permalink raw reply
* Re: [RFC PATCH] Documentation: Add managed interrupts
From: Aaron Tomlin @ 2026-04-14 20:10 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Valentin Schneider, linux-doc, linux-kernel, Christoph Hellwig,
Frederic Weisbecker, Jens Axboe, Jonathan Corbet, Ming Lei,
Thomas Gleixner, Waiman Long, Peter Zijlstra, John Ogness
In-Reply-To: <20260413155726.BpD5Eh0T@linutronix.de>
On Mon, Apr 13, 2026 at 05:57:26PM +0200, Sebastian Andrzej Siewior wrote:
> For the managed_irq you could argue that this could also use some
> runtime configuration at which point isolcpus= would have a runtime
> counterpart and could be removed.
> After going through all this I concluded that it makes hardly sense
> since you would require callbacks in every driver using it or other
> magic "to reconfigure" but it already makes little sense using it.
>
> Either way, I don't see anything wrong with using isolcpus=domain if you
> have a static setup and need/ want reconfigure at runtime.
>
Hi Sebastian,
I completely agree.
--
Aaron Tomlin
^ permalink raw reply
* [PATCH] docs: kernel-parameters: document scope of irqaffinity= parameter
From: Aaron Tomlin @ 2026-04-14 20:02 UTC (permalink / raw)
To: corbet, skhan
Cc: tglx, akpm, bp, rdunlap, dave.hansen, feng.tang,
pawan.kumar.gupta, dapeng1.mi, kees, elver, paulmck, lirongqing,
bhelgaas, linux-doc, linux-kernel
System administrators frequently use the "irqaffinity=" boot parameter
in conjunction with CPU isolation to build deterministic, latency-free
environments. However, there is a widespread misconception that
"irqaffinity=" acts as a global, absolute override for all hardware
interrupts.
In reality, "irqaffinity=" strictly populates the irq_default_affinity
mask. When the kernel allocates multiqueue vectors
(e.g., irq_create_affinity_masks()), it explicitly bypasses this default
mask for managed interrupts. Instead, it relies on dynamic spreading
algorithms to map queues to the available topology, effectively
overriding any default the administrator set via the command line.
This patch explicitly documents this limitation in kernel-parameters.txt
to set correct expectations and directs users to the appropriate
"isolcpus=" sub-parameters for managed interrupt isolation.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9ed7c3ecd158..40ca92d8cf04 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2732,6 +2732,14 @@ Kernel parameters
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
+ Note: This parameter only sets the default affinity
+ for unmanaged interrupts (e.g., legacy single-queue
+ devices or unmanaged pre/post vectors). It is
+ explicitly ignored by managed interrupts, such as
+ those utilised by modern multiqueue storage
+ controllers. To isolate CPUs from managed
+ interrupts, see the "managed_irq".
+
irqchip.gicv2_force_probe=
[ARM,ARM64,EARLY]
Format: <bool>
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v5 00/24] ARM64 PMU Partitioning
From: Colton Lewis @ 2026-04-14 19:55 UTC (permalink / raw)
To: Colton Lewis
Cc: will, oupton, kvm, pbonzini, corbet, linux, catalin.marinas, maz,
oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest
In-Reply-To: <gsntpl7a694p.fsf@coltonlewis-kvm.c.googlers.com>
Colton Lewis <coltonlewis@google.com> writes:
> Will Deacon <will@kernel.org> writes:
>> On Tue, Dec 09, 2025 at 03:00:59PM -0800, Oliver Upton wrote:
>>> On Tue, Dec 09, 2025 at 08:50:57PM +0000, Colton Lewis wrote:
>>> > This series creates a new PMU scheme on ARM, a partitioned PMU that
>>> > allows reserving a subset of counters for more direct guest access,
>>> > significantly reducing overhead. More details, including performance
>>> > benchmarks, can be read in the v1 cover letter linked below.
>>> >
>>> > An overview of what this series accomplishes was presented at KVM
>>> > Forum 2025. Slides [1] and video [2] are linked below.
>>> >
>>> > The long duration between v4 and v5 is due to time spent on this
>>> > project being monopolized preparing this feature for internal
>>> > production. As a result, there are too many improvements to fully list
>>> > here, but I will cover the notable ones.
>>> Thanks for reposting. I think there's still quite a bit of ground to
>>> cover on the KVM side of this, but I would definitely appreciate it if
>>> someone with more context on the perf side of things could chime in.
>>> Will, IIRC you had some thoughts around counter allocation, right?
>> Right, I was hoping that the host counter reservation could be more
>> dynamic than a cmdline option. Perf already has support for pinning
>> events to a CPU, so the concept of some counters being unavailable
>> shouldn't be too much for the driver to handle. You might just need to
>> create some fake pinned events so that perf code understands what is
>> happening.
> Thanks Will. I have a few followup questions:
> 1. Are you suggesting this be done whenever we enter a guest so the host
> always has access to the full range in host context? That would be the
> most dynamic.
> 2. How should we handle the possibility a real event already occupies a
> counter wanted by the guest? Is there a good way to create our fake
> pinned events then force a reschedule so perf moves the real events out
> of the way?
> 3. Is there an existing fake event type that tells perf not to touch
> hardware?
> 4. Can you point to any example code that already does something like
> this?
Thank you Will and Mark for meeting with me to discuss things in person.
Here's my main takeaways so the list can comment:
Will's initial idea doesn't work because there is no way for KVM to pin
counters in a way that takes priority over counters pinned by the host
and therefore guarantee reservation.
An alternate idea I am proposing is to call the perf core
sched_in/sched_out functionality during vcpu_load/vcpu_put when guest
counters need to be reserved/unreserved.
That means having perf vacate all the host counters temporarily,
modifying the arm_pmu.cntr_mask to add/remove the appropriate counters,
then having perf schedule all host events back on the new set. Perf is
capable of doing that without any significant changes.
This is simple and should work because arm_pmu.cntr_mask is already
accessible from the vcpu struct and modifying it is already how the
existing boot-time counter reservation works.
There are some tradeoffs to this approach that will need further
consideration. The first is how to handle event groups. Perf allows
events to be grouped such that they must all be scheduled in at once. If
the host has a larger group than the number of counters available while
the vcpu is loaded, then it simply won't be able to schedule that group
in for that time period. Another is whether it will be acceptable
performance-wise to put perf sched_in/sched_out in
vcpu_load/vcpu_put. I'm unsure how much delay that would add to those
paths.
Absent strong objections, I will be posting a series using this method.
Another idea that was not discussed that I had later is a middle
approach that is less dynamic but gives the user control over when the
perf sched_in/sched_out happens. Expose the existing boot-time parameter
as writable in sysfs and do the sched_out/modify mask/sched_in when that
is written rather than in vcpu_load.
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: Darrick J. Wong @ 2026-04-14 18:57 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, John Groves,
Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <ad4_jFsR951c2Mtn@groves.net>
On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > Overall, my intention with bringing this up is just to make sure we're
> > > at least aware of this alternative before anything is merged and
> > > permanent. If Miklos and you think we should land this series, then
> > > I'm on board with that.
> >
> > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > absolutely necessary. This was the main sticking point originally,
> > but there seemed to be no better alternative.
> >
> > However with the bpf approach this would be gone, which is great.
Well... you can't get away with having *no* mapping interface at all.
You still have to define a UABI that BPF programs can use to convey
mapping data into fsdax/iomap. BTF is a nice piece of work that smooths
over minor fluctuations in struct layout between a running kernel and
a precompiled BPF program, but fundamentally we still need a fuse-native
representation.
That last sentence was an indirect way of saying: No, we're not going
to export struct iomap to userspace. The fuse-iomap patchset provides
all the UABI pieces we need for regular filesystems (ext4) and hardware
adjacent filesystems (famfs) to exchange file mapping data with the
kernel. This has been out for review since last October, but the lack
of engagement with that patchset (or its February resubmission) doesn't
leave me with confidence that any of it is going anywhere.
Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
upload bpf programs to generate interleaved mappings. It's not so hard
to convert famfs' iomapping paths to use fuse-iomap, but I haven't
helped him do that because:
a) I have no idea what Miklos' thoughts are about merging any of the
famfs stuff.
b) I also have no idea what his thoughts are about fuse-iomap. The
sparse replies are not encouraging.
c) It didn't seem fair to John to make him take on a whole new patchset
dependency given (a) and (b).
d) Nobody ever replied to my reply to the LSFMM thread about "can we do
some code review of fuse iomap without waiting three months for LSFMM?"
I've literally done nothing with fuse-iomap for two of the three months
requested.
> > So let us please at least have a try at this. I'm not into bpf yet,
> > but willing to learn.
I sent out the patches to enable exactly this sort of experimentation
two months ago, and have not received any responses:
https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
I would like to say this as gently as possible: I don't know what the
problem here is, Miklos -- are you uninterested in the work? Do you
have too many other things to do inside RH that you can't talk about?
Is it too difficult to figure out how the iomap stuff fits into the rest
of the fuse codebase? Do you need help from the rest of us to get
reviews done? Is there something else with which I could help?
Because ... over the past few years, many of my team's filesystem
projects have endured monthslong review cycles and often fail to get
merged. This has led to burnout and frustration among my teammates such
that many of them chose to move on to other things. For the remaining
people, it was very difficult to justify continuing headcount when
progress on projects is so slow that individuals cannot achieve even one
milestone per quarter on any project.
There's now nobody left here but me.
I'm not blaming you (Miklos) for any of this, but that is the current
deplorable state of things.
> > Thanks,
> > Miklos
>
> Thanks for responding...
>
> My short response: Noooooooooo!!!!!!
>
> I very strongly object to making this a prerequisite to merging. This
> is an untested idea that will certainly delay us by at least a couple
> of merge windows when products are shipping now, and the existing approach
> has been in circulation for a long time. It is TOO LATE!!!!!!
/me notes that has "we're shipping so you have to merge it over peoples'
concerns" rarely carries the day in LKML land, and has never ended well
in the few cases that it happens. As Ted is fond of saying, this is a
team sport, not an individual effort. Unfortunately, to abuse your
sports metaphor, we all play for the ******* A's.
That said, you're clearly pissed at the goalposts changing yet again,
and that's really not fair that we collectively keep moving them.
It's a rotten situation that I could have even helped you to solve both
our problems via fuse-iomap, but I just couldn't motivate myself to
entwine our two projects until the technical direction questions got
answered.
> Famfs is not a science project, it's enablement for actual products and
> early versions are available now!!!
>
> That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
Heck, the fuse command field is a u32. There are plenty of numberspace
left, and the kernel can just *stop issuing them*.
> What are the risks of converting to BPF?
>
> - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> curve cost about a year because this is not that similar to anything
> else that was already in fuse.
...and per above, BPF isn't some magic savior that avoids the expansion
of the UABI.
> - Those of us who are involved don't fully understand either the security
> or performance implications of this. It
Correct. I sure think it's swell that people can inject IR programs
that jit/link into the kernel. Don't ask which secondary connotation of
"swell" I'm talking about.
> - Famfs is enabling access to memory and mapping fault handling must be
> at "memory speed". We know that BPF walks some data structures when a
> program executes. That exposes us to additional serialized L3 cache
> misses each time we service a mapping fault (any TLB & page table miss).
> This should be studied side-by-side with the existing approach under
> multiple loads before being adopted for production.
Yes, it should. AFAICT if one switched to a per-inode bpf program, then
you could do per-inode bpf programs. Then you don't even need the bpf
map, and the ->iomap_begin becomes an indirect call into JITted x86_64
math code.
(The downside is that dyn code can't be meaningfully signed, requires
clang on the system, and you have to deal with inode eviction issues.)
> - This has never been done in production, and we're throwing it in the way
> of a project that has been soaking for years and needs to support early
> shipments of products.
Correct. I haven't even implemented BPF-iomap for fuse4fs. This BPF
integration stuff is *highly* experimental code.
> If this is the only path, I'd like to revive famfs as a standalone file
> system. I'm still maintaining that and it's still in use.
Honestly, you should probably just ship that to your users. As long as
the ondisk format doesn't change much, switching the implementation at a
later date is at least still possible.
--D
^ permalink raw reply
* [PATCH v4 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 18:12 UTC (permalink / raw)
To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
Derek Barbosa, linux-doc, linux-kernel
In-Reply-To: <ad5_XCnVDlC9Hvup@lx-t490>
Add a configuration guide for real-time kernels.
List all Kconfig options that are recommended to be either enabled or
disabled. Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.
Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.
Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---
* Changelog v4
Handle Sashiko's review remarks at
https://sashiko.dev/#/patchset/ad5_XCnVDlC9Hvup%40lx-t490
Documentation/core-api/real-time/index.rst | 1 +
.../real-time/kernel-configuration.rst | 310 ++++++++++++++++++
2 files changed, 311 insertions(+)
create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst
diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
differences
hardware
architecture-porting
+ kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..73f7730d468e
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,310 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+ :depth: 3
+ :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency. It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency. While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not. It is thus desirable to keep this option enabled.
+
+.. caution::
+
+ A real-time kernel is not about being "as fast as possible", however
+ real-time requirements may demand that the CPU is clocked at a
+ particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution. Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule. Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements. The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity. Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking. This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior. Users should cap the maximum C-state to C1 instead. For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+ processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements. For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth. Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures. EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``). For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks. During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch. This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default. If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+ efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+ ``NO_HZ`` modes. Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+ ``NO_HZ_FULL`` may be beneficial. In such cases, users should offload
+ the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended. This will allow the users to extract more information
+if latency problems arise. Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+ Users should *not* make use of tracers or trace events during production
+ real-time kernel operation as they can add considerable overhead and
+ degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible. This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled. In that case,
+users should set a *stable* CPU frequency setting during the system
+runtime, as changing the CPU frequency will increase the system latency and
+affect latency measurements reproducibility. If a lower CPU frequency is
+desired, then ``CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since it dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. caution::
+
+ During development and early testing, users are encouraged to run their
+ real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+ other kernel debug options enabled, for a considerable amount of time.
+ Such workloads might trigger kernel code paths that were not triggered
+ during the internal Linux real-time kernel development, thus helping to
+ uncover locking and other types of kernel bugs.
+
+Problematic debug options
+-------------------------
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels. These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact. It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled. This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system. Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace. All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260227170103.4042157-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v10 01/12] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
From: Pawan Gupta @ 2026-04-14 18:05 UTC (permalink / raw)
To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet
Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-1-efa924abae5f@linux.intel.com>
On Tue, Apr 14, 2026 at 12:05:28AM -0700, Pawan Gupta wrote:
> Currently, the BHB clearing sequence is followed by an LFENCE to prevent
> transient execution of subsequent indirect branches prematurely. However,
> the LFENCE barrier could be unnecessary in certain cases. For example, when
> the kernel is using the BHI_DIS_S mitigation, and BHB clearing is only
> needed for userspace. In such cases, the LFENCE is redundant because ring
> transitions would provide the necessary serialization.
>
> Below is a quick recap of BHI mitigation options:
>
> On Alder Lake and newer
>
> BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
> performance overhead.
>
> Long loop: Alternatively, a longer version of the BHB clearing sequence
> can be used to mitigate BHI. It can also be used to mitigate the BHI
> variant of VMSCAPE. This is not yet implemented in Linux.
>
> On older CPUs
>
> Short loop: Clears BHB at kernel entry and VMexit. The "Long loop" is
> effective on older CPUs as well, but should be avoided because of
> unnecessary overhead.
>
> On Alder Lake and newer CPUs, eIBRS isolates the indirect targets between
> guest and host. But when affected by the BHI variant of VMSCAPE, a guest's
> branch history may still influence indirect branches in userspace. This
> also means the big hammer IBPB could be replaced with a cheaper option that
> clears the BHB at exit-to-userspace after a VMexit.
>
> In preparation for adding the support for the BHB sequence (without LFENCE)
> on newer CPUs, move the LFENCE to the caller side after clear_bhb_loop() is
> executed. Allow callers to decide whether they need the LFENCE or not. This
> adds a few extra bytes to the call sites, but it obviates the need for
> multiple variants of clear_bhb_loop().
>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Tested-by: Jon Kohler <jon@nutanix.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
Sorry this is missing Boris's Ack, I will fix.
> Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
^ permalink raw reply
* [PATCH v3 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:54 UTC (permalink / raw)
To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
Derek Barbosa, linux-doc, linux-kernel
In-Reply-To: <20260414174159.1271171-2-darwi@linutronix.de>
Add a configuration guide for real-time kernels.
List all Kconfig options that are recommended to be either enabled or
disabled. Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.
Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.
Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---
* Changelog-v3
Order the "Problematic debug options" section alphabetically, thus matching
rest of the document. Link to v2 of bigeasy EFI runtime services work,
instead of v1.
Documentation/core-api/real-time/index.rst | 1 +
.../real-time/kernel-configuration.rst | 313 ++++++++++++++++++
2 files changed, 314 insertions(+)
create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst
diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
differences
hardware
architecture-porting
+ kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..ab06ec2c6ef8
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,313 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+ :depth: 3
+ :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency. It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency. While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not. It is thus desirable to keep this option enabled.
+
+.. caution::
+
+ A real-time kernel is not about being "as fast as possible", however
+ real-time requirements may demand that the CPU is clocked at a
+ particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution. Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule. Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements. The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity. Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking. This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior. Users should cap the maximum C-state to C1 instead. For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+ processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements. For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth. Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures. EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``). For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks. During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch. This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default. If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+ efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+ ``NO_HZ`` modes. Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+ ``NO_HZ_FULL`` may be beneficial. In such cases, users should offload
+ the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended. This will allow the users to extract more information
+if latency problems arise. Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+ Users should *not* make use of tracers or trace events during production
+ real-time kernel operation as they can add considerable overhead and
+ degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible. This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled. In that case, users
+should set a *stable* CPU frequency setting during the system runtime, as
+changing the CPU frequency will increase the system latency and affect
+latency measurements reproducibility. If a lower CPU frequency is desired,
+then ``CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since in dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. TODO: Connect lockdep with PROVE_LOCKING. Make it clear that it does
+.. not uncover latency issues.
+
+.. caution::
+
+ During development and early testing, users are encouraged to run their
+ real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+ other kernel debug options enabled, for a considerable amount of time.
+ Such workloads might trigger kernel code paths that were not triggered
+ during the internal Linux real-time kernel development, thus helping to
+ uncover locking bugs and any real-time latency issues in the kernel.
+
+Problematic debug options
+-------------------------
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels. These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact. It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled. This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system. Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace. All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260227170103.4042157-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
--
2.53.0
^ permalink raw reply related
* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Peter Xu @ 2026-04-14 17:45 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
James Houghton, Andrea Arcangeli
In-Reply-To: <ad5hAVuRwa_0VNPf@thinkstation>
On Tue, Apr 14, 2026 at 06:08:48PM +0100, Kiryl Shutsemau wrote:
> On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote:
> > Hi, Kiryl,
> >
> > On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> > > This series adds userfaultfd support for tracking the working set of
> > > VM guest memory, enabling VMMs to identify cold pages and evict them
> > > to tiered or remote storage.
> >
> > Thanks for sharing this work, it looks very interesting to me.
> >
> > Personally I am also looking at some kind of VMM memtiering issues. I'm
> > not sure if you saw my lsfmm proposal, it mentioned the challenge we're
> > facing, it's slightly different but still a bit relevant:
> >
> > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/
>
> Thanks will read up. I didn't follow userfultfd work until recently.
Thanks. Note that the proposal doesn't have much with userfaultfd. You'll
see when you start reading.
>
> > Unfortunately, that proposal was rejected upstream.
>
> Sorry about that. We can chat about in hall track, if you are there :)
I won't be there (as it's rejected.. hence not invited). But I'm always
happy to discuss on this topic on the list or elsewhere. Alone the way I
believe it'll also help us to know what is the most acceptable path
forward as it's still very relevant.
>
> > > == VMM Workflow ==
> >
> > AFAIU, this workflow provides two functionalities:
> >
> > >
> > > UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls
> > > sleep(interval)
> > > PAGEMAP_SCAN -- find cold pages
> >
> > Until here it's only about page hotness tracking. I am curious whether you
> > evaluated idle page tracking. Is it because of perf overheads on rmap?
>
> I didn't gave idle page tracking much thought. I needed uffd faults to
> serialize reclaim against memory accesses. If use it for one thing we
> can as well try to use it for tracking as well. And it seems to be
> fitting together nicely with sync/async mode flipping.
Yes, I get your point.
It's just that it'll still partly done what access bit has already been
doing for mm core in general on tracking hotness. So I wonder if we should
still try to see if we can separate the two problems.
One other quick thought is maybe we could also report hotness from kernel
directly rather than relying on async faults, you can refer to "(2) Hotness
Information API" in my above proposal. Here when it's only about knowing
which page is less frequently used, it's only a READ interface.
>
> > To
> > me, your solution (until here.. on the hotness sampling) reads more like a
> > more efficient way to do idle page tracking but only per-mm, not per-folio.
> >
> > That will also be something I would like to benefit if QEMU will decide to
> > do full userspace swap. I think that's our last resort, I'll likely start
> > with something that makes QEMU work together with Linux on swapping
> > (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
> > currently uses, as long as efficient) then QEMU only cares about the rest,
> > which is what the migration problem is about.
> >
> > The other issue about idle page tracking to us is, I believe MGLRU
> > currently doesn't work well with it (due to ignoring IDLE bits) where the
> > old LRU algo works. I'm not sure how much you evaluated above, so it'll be
> > great to share from that perspective too. I also mentioned some of these
> > challenges in the lsfmm proposal link above.
> >
> > > UFFDIO_SET_MODE(sync) -- block faults for eviction
> > > pwrite + MADV_DONTNEED cold pages -- safe, faults block
> > > UFFDIO_SET_MODE(async) -- resume tracking
> >
> > These operations are the 2nd function. It's, IMHO, a full userspace swap
> > system based on userfaultfd.
>
> Right. And we want to decide where to put cold pages from userspace.
>
> > Have you thought about directly relying on userfaultfd-wp to do this work?
> > The relevant question is, why do we need to block guest reads on pages
> > being evicted by the userapp? Can we still allow that to happen, which
> > seems to be more efficient? IIUC, only writes / updates matters in such
> > swap system.
>
> But we do care about about read accesses. We don't want to swap out
> pages that got read-touched. And we cannot in practice switch to WP mode
This is a good point.
When it's considered on top of your above "async trapping to collect
hotness with userfaultfd" idea, it flows naturally with this idea indeed.
However, IMHO that should really be an extremely small window, and the
major part the userapp should rely on is the larger window sampling
whether, in your current case, PROT_NONE (or PTE_NONE for shmem) switched
back to a accessable PTE.
It means using RW protection v.s. WR-ONLY protection will only differ very
slightly if by accident some page got read-only during evicting. For
example, if the mgmt app monitors PROT_NONE state for 30 seconds, make a
decision to evict, evicting takes 5ms, then within 5ms someone read the
page. It means it only misses the 5ms/30sec access pattern of guest.
So far I don't yet know if this would justify a new kernel API just for
that small false postive reporting some page is cold but actually it's hot.
To me it's still fine to consider using WP-ONLY and just allow that trivial
window to get refaulted later, because it shouldn't be the majority.
> after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls
> with TLB flushing each.
This is indeed a concern, maybe a bigger one. I don't know how much
benefit we can get from avoiding one extra TLB flush when evicting. IMHO
some numbers might be more than great to justify this part.
While at this, I do have a pure question that is relevant on the full
protection scheme (and it can be naive; please bare with me on not yet
reading the whole series): if you change anon mappings to PROT_NONE in
pgtables, then how do the mgmt app reads this page before dumping it to
anywhere? It's not like shmem where you can have a separate mapping.
Do you need to fork(), for example?
>
> With my approach switching tracking and reclaiming is single bit flip
> under mmap lock.
>
> > Also, I'm not sure if you're aware of LLNL's umap library:
> >
> > https://github.com/llnl/umap
> >
> > That implemnted the swap system using userfaultfd wr-protect mode only, so
> > no new kernel API needed.
>
> Will look into it. Thanks.
Thanks,
--
Peter Xu
^ permalink raw reply
* [PATCH v2 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:41 UTC (permalink / raw)
To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
Derek Barbosa, linux-doc, linux-kernel, Ahmed S. Darwish
In-Reply-To: <20260414174159.1271171-1-darwi@linutronix.de>
Add a configuration guide for real-time kernels.
List all Kconfig options that are recommended to be either enabled or
disabled. Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.
Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.
Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---
Documentation/core-api/real-time/index.rst | 1 +
.../real-time/kernel-configuration.rst | 313 ++++++++++++++++++
2 files changed, 314 insertions(+)
create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst
diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
differences
hardware
architecture-porting
+ kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..4310ca85f014
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,313 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+ :depth: 3
+ :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency. It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency. While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not. It is thus desirable to keep this option enabled.
+
+.. caution::
+
+ A real-time kernel is not about being "as fast as possible", however
+ real-time requirements may demand that the CPU is clocked at a
+ particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution. Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule. Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements. The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity. Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking. This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior. Users should cap the maximum C-state to C1 instead. For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+ processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements. For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth. Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures. EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``). For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks. During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch. This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default. If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+ efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+ ``NO_HZ`` modes. Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+ ``NO_HZ_FULL`` may be beneficial. In such cases, users should offload
+ the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended. This will allow the users to extract more information
+if latency problems arise. Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+ Users should *not* make use of tracers or trace events during production
+ real-time kernel operation as they can add considerable overhead and
+ degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible. This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled. In that case, users
+should set a *stable* CPU frequency setting during the system runtime, as
+changing the CPU frequency will increase the system latency and affect
+latency measurements reproducibility. If a lower CPU frequency is desired,
+then ``CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since in dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. TODO: Connect lockdep with PROVE_LOCKING. Make it clear that it does
+.. not uncover latency issues.
+
+.. caution::
+
+ During development and early testing, users are encouraged to run their
+ real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+ other kernel debug options enabled, for a considerable amount of time.
+ Such workloads might trigger kernel code paths that were not triggered
+ during the internal Linux real-time kernel development, thus helping to
+ uncover locking bugs and any real-time latency issues in the kernel.
+
+Problematic debug options
+-------------------------
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels. These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact. It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled. This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system. Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace. All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260205115559.1625236-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
--
2.53.0
^ permalink raw reply related
* [PATCH v2 0/1] Documentation: Add real-time kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:41 UTC (permalink / raw)
To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
Derek Barbosa, linux-doc, linux-kernel, Ahmed S. Darwish
Hi,
There is a no "one size fits all" solution for configuring a PREEMPT_RT
kernel. Intorduce a PREEMPT_RT kernel configuration guide to better help
system developers and integrators.
Changelog v2
------------
Handle Rostedt remarks:
- Better reword certain paragraphs and statements
- Warn about enabling CONFIG_IRQSOFF_TRACER and CONFIG_PREEMPT_TRACER
Handle Wilcox remarks:
- Remove ToC comment + minor rewording
Changelog v1
------------
https://lore.kernel.org/lkml/20260305205023.361530-1-darwi@linutronix.de
Thanks,
8<-----
Documentation/core-api/real-time/index.rst | 1 +
.../real-time/kernel-configuration.rst | 313 ++++++++++++++++++
2 files changed, 314 insertions(+)
create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst
base-commit: 028ef9c96e96197026887c0f092424679298aae8
--
2.53.0
^ permalink raw reply
* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:32 UTC (permalink / raw)
To: Kairui Song
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com>
On Tue, Apr 14, 2026 at 10:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> * I still think there's a good chance we can *significantly* close the
> gap overall between a design with virtual swap and a design without.
> It's a bit premature to commit to a vswap-optional route (which to be
> completely honest I'm still not confident is possible to satisfy all
> of our requirements).
And to further note - these benchmark measure, in effect, purely swap
overhead. In a production environment with a lot of non-swap work, as
long as the gap is close enough I think we would be fine, even for a
hostile case like a fast swapfile-backend (I assume SSD swap's
bottleneck will be the IO mostly).
I will stare at your responses to see if there is other benchmark I
can play with, but it would be very helpful if you can share your full
suite :)
^ permalink raw reply
* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:23 UTC (permalink / raw)
To: Kairui Song
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com>
On Mon, Mar 23, 2026 at 1:05 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > > This patch series is based on 6.19. There are a couple more
> > > > > swap-related changes in mainline that I would need to coordinate
> > > > > with, but I still want to send this out as an update for the
> > > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > > to just build this thing rather than dig through that series of
> > > > > emails to get the fix patch :)
> > > > >
> > > > > Changelog:
> > > > > * v4 -> v5:
> > > > > * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > > > * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > > > and use guard(rcu) in vswap_cpu_dead
> > > > > (reported by Peter Zijlstra [17]).
> > > > > * v3 -> v4:
> > > > > * Fix poor swap free batching behavior to alleviate a regression
> > > > > (reported by Kairui Song).
> > > >
> > >
> > > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > > the regression in this patch series - we can talk more about
> > > directions in another thread :)
Hi Kairui,
My apologies if I missed your response, but could you share with me
your full benchmark suite? It would be hugely useful, not just for
this series, but for all swap contributions in the future :) We should
do as much homework ourselves as possible :P
And apologies for the delayed response. I kept having to back and
forth between regression investigating, and figuring out what was
going on with the build setups (I missed some of the CONFIGs you had
originally), reducing variance on hosts, etc.
I don't have PMEM, so I have only worked with zram backend so far. I
did manage to reproduce the regressions you showed me (albeit at a
much smaller gap on certain metrics than your cited numbers, which I
suspect is due to zram/pmem difference).
There are two benchmarks that I focused on:
1. Usemem - the exact command I ran is: time ./usemem --init-time -O
-y -x -n 1 56G
My host is 32GB, 52 processor(s) / x86_64.
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
vss_v5 184.0 +/- 3.9 +4.8% 130.5 +/- 3.8 376,192 +/-
8,581 8,297 +/- 247
(I hope the formatting works, but let me know if it looks weird).
2. Memhog: time memhog 48G
My host for this one is 16 GB, 52 processors, x86_64 too.
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
vss_v5 83.0 +/- 1.8 +3.1% 65.7 +/- 1.8
On both benchmark, I enable MGLRU, to more closely match the setup you had.
Staring at the run logs (and double check with the logs you sent me to
make sure it's not just on my system), there are some common patterns
I noticed across these runs:
1. Kswapd is slower on the vswap side, which shifts work towards
direct reclaim, and makes compaction have to run harder (which has a
weird contention through zsmalloc - I can expand further, but this is
not vswap-specific, just exacerbated by slower kswapd).
2. Higher swap readahead (albeit with higher hit rate) - this is more
of an artifact of the fact that zero swap pages are no longer backed
by zram swapfile, which skipped readahead in certain paths. We can
ignore this for now, but worth assessing this for fast swap backends
in general (zero swap pages, zswap, so on and so forth).
I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
that I can perf the free stage of usemem separately. Most of the
vswap-specific overhead lies in the xarray lookups. Some big offenders
on top of my mind:
1. Right now, in the physical swap allocator, whenever we have an
allocated slot in the range we're checking, we check if that slot is
swap-cache-only (i.e no swap count), and if so we try to free it (if
swapfile is almost full etc.). This check is cheap if all swap entry
metadata live in physical swap layer only, but more expensive when you
have to go through another layer of indirection :)
I fixed that by just taking one bit in the reverse map to track
swap-cache-only state, which eliminates this without extra space
overhead (on top of the existing design).
2. On the free path, in swap_pte_batch(), we check cgroup to make sure
that the range we pass to free_swap_and_cache_nr() belongs to the same
cgroup, which has a per-PTE overhead for going to the vswap layer. We
can make this check once-per range instead, to reduce overhead. Even
better - we can skip this check in swap_pte_batch() for the free case,
and deferred this check to later on where we already enter vswap
cluster lock context :)
With a bunch of changes like that, I closed the gap majorly:
usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
new_opt_v2 179.8 +/- 3.0 +2.4% 126.1 +/- 2.9 382,536 +/-
6,662 7,105 +/- 183
memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
new_opt_v2 79.9 +/- 1.7 -0.8% 62.4 +/- 1.7
I would like to also point out that, some of this overhead is specific
to the swapfile backend case, which is why we don't see this in zswap
in the stats I included in V5. Zswap does not require this
swap-cache-only dance, because in virtual swap, zswap only needs the
virtual swap slot as the index (on top of much more negligible space
overhead thanks to zswap tree merging into vswap cluster, no swap
charging, no double allocation, etc.).
Anyway, still a small gap. The next idea that I have is inspired by
TLB, which cache virtual->physical memory address translation. I added
a per-CPU MRU virtual cluster. The idea is that a lot of consecutive
swap operations operate on the same range of swap entries - merging
these operations of course makes the most sense, but sometimes it's
not convenient to do it. The non-vswap, old design sometimes lock the
physical swap cluster and expose the swap cluster struct to callers to
pass around, but I would like to avoid that if possible :)
With this change, we close the gap even further - exceeding the
baseline in average in certain cases, but as you can see it's within
noises so I wouldn't conclude too much out of it:
usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
cc_v2 176.4 +/- 5.3 +0.4% 123.6 +/- 5.4 390,405 +/-
12,792 6,987 +/- 296
memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
cc_v2 79.9 +/- 0.9 -0.8% 62.1 +/- 1.5
The reclaim and compaction stats tell a similar story:
Reclaim / Compaction (usemem)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 167,787 +/- 10,292 170,532 +/-
15,185 169,782 +/- 9,903 168,635 +/- 13,526
pgsteal_kswapd 6,932,143 +/- 186,411 6,965,962 +/-
288,323 6,968,188 +/- 286,383 7,038,513 +/- 202,696
pgsteal_direct 9,759,350 +/- 480,674 9,978,721 +/-
765,543 9,899,698 +/- 480,781 9,845,668 +/- 544,319
swap_ra 82.9 +/- 22.6 5994.8 +/-
2817.5 4976.8 +/- 1484.2 4718.2 +/- 1510.5
pgmigrate 1,029,901 +/- 428,416 1,687,072 +/-
399,505 1,260,451 +/- 202,603 1,144,560 +/- 490,177
Reclaim / Compaction (memhog)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 101,245 +/- 6,271 109,320 +/-
12,180 100,207 +/- 11,053 99,223 +/- 9,905
pgsteal_kswapd 8,817,264 +/- 432,519 8,436,548 +/-
265,763 8,728,944 +/- 305,101 8,962,443 +/- 589,012
pgsteal_direct 5,408,046 +/- 394,775 5,932,611 +/-
584,873 5,419,891 +/- 551,226 5,349,352 +/- 601,655
swap_ra 66.5 +/- 22.8 8589.5 +/-
3325.1 8954.5 +/- 2661.9 8703.1 +/- 1746.6
pgmigrate 239,410 +/- 46,014 277,193 +/-
71,487 320,672 +/- 59,488 243,989 +/- 136,129
You can see that the latter versions gradually restore the behaviors
of baseline in terms of reclaim dynamics :)
Some final remarks:
* I still think there's a good chance we can *significantly* close the
gap overall between a design with virtual swap and a design without.
It's a bit premature to commit to a vswap-optional route (which to be
completely honest I'm still not confident is possible to satisfy all
of our requirements).
* Regardless of the direction we take, these are all pitfalls that
will be problematic for virtual swap design, and more generally some
of them will affect any dynamic swap design (which has to go through
some sort of indirection or a dynamic data structure like xarray that
will induce some amount of lookup overhead). I hope my work here can
be useful in this sense too, outside of this specific vswap direction
:)
I will clean things up a bit and send you a v6 for further inspection.
Once again, I'd like to express my gratitude for your engagement and
feedback.
^ permalink raw reply
* Re: [PATCH v5] Documentation: Refactored watchdog old doc
From: Guenter Roeck @ 2026-04-14 17:18 UTC (permalink / raw)
To: Sunny Patel, linux-doc; +Cc: linux-watchdog, linux-kernel, corbet, wim, rdunlap
In-Reply-To: <20260413041215.10362-1-nueralspacetech@gmail.com>
On 4/12/26 21:11, Sunny Patel wrote:
> Mark WDIOC_GETTEMP and WDIOS_TEMPPANIC as deprecated since
> neither is implemented by the watchdog core and both are only
> present in a small number of legacy drivers.
>
> Add documentation for previously undocumented status bits
> WDIOF_MAGICCLOSE and WDIOF_ALARMONLY in the options field.
>
> Add documentation for WDIOF_PRETIMEOUT and WDIOF_SETTIMEOUT
> status bits describing their respective ioctls.
>
> Fix the following issues in existing documentation:
> - Remove version-specific reference to Linux 2.4.18 from
> the GETTIMEOUT ioctl description
> - Fix duplicate "was is" in printf format strings
> - Replace [FIXME] placeholder with proper descriptions for
> WDIOS_DISABLECARD, WDIOS_ENABLECARD and WDIOS_TEMPPANIC
>
> Signed-off-by: Sunny Patel <nueralspacetech@gmail.com>
Reviewed-by: Guenter Roeck <linux@ropeck-us.net>
> ---
>
> Changes in v5:
> - Fixed WDIOC_GETTIMELEFT printf statement to correctly reference
> "timeleft" instead of "timeout".
>
> Changes in v4:
> - Fixed WDIOS_DISABLECARD description: corrected inverted logic —
> the ioctl disables the hardware timer entirely rather than
> stopping pings. Clarified that userspace, not the kernel driver,
> is primarily responsible for pinging under normal operation.
>
> Documentation/watchdog/watchdog-api.rst | 65 +++++++++++++++++++++----
> 1 file changed, 55 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/watchdog/watchdog-api.rst b/Documentation/watchdog/watchdog-api.rst
> index 78e228c272cf..736436a68f65 100644
> --- a/Documentation/watchdog/watchdog-api.rst
> +++ b/Documentation/watchdog/watchdog-api.rst
> @@ -2,7 +2,7 @@
> The Linux Watchdog driver API
> =============================
>
> -Last reviewed: 10/05/2007
> +Last reviewed: 04/08/2026
>
>
>
> @@ -42,7 +42,7 @@ activates as soon as /dev/watchdog is opened and will reboot unless
> the watchdog is pinged within a certain time, this time is called the
> timeout or margin. The simplest way to ping the watchdog is to write
> some data to the device. So a very simple watchdog daemon would look
> -like this source file: see samples/watchdog/watchdog-simple.c
> +like this source file: see samples/watchdog/watchdog-simple.c
>
> A more advanced driver could for example check that a HTTP server is
> still responding before doing the write call to ping the watchdog.
> @@ -106,11 +106,10 @@ the requested one due to limitation of the hardware::
> This example might actually print "The timeout was set to 60 seconds"
> if the device has a granularity of minutes for its timeout.
>
> -Starting with the Linux 2.4.18 kernel, it is possible to query the
> -current timeout using the GETTIMEOUT ioctl::
> +It is also possible to get the current timeout with the GETTIMEOUT ioctl::
>
> ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
> - printf("The timeout was is %d seconds\n", timeout);
> + printf("The timeout is %d seconds\n", timeout);
>
> Pretimeouts
> ===========
> @@ -133,7 +132,7 @@ seconds. Setting a pretimeout to zero disables it.
> There is also a get function for getting the pretimeout::
>
> ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout);
> - printf("The pretimeout was is %d seconds\n", timeout);
> + printf("The pretimeout is %d seconds\n", timeout);
>
> Not all watchdog drivers will support a pretimeout.
>
> @@ -145,12 +144,12 @@ before the system will reboot. The WDIOC_GETTIMELEFT is the ioctl
> that returns the number of seconds before reboot::
>
> ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);
> - printf("The timeout was is %d seconds\n", timeleft);
> + printf("The timeleft is %d seconds\n", timeleft);
>
> Environmental monitoring
> ========================
>
> -All watchdog drivers are required return more information about the system,
> +All watchdog drivers are required to return more information about the system,
> some do temperature, fan and power level monitoring, some can tell you
> the reason for the last reboot of the system. The GETSUPPORT ioctl is
> available to ask what the device can do::
> @@ -227,12 +226,33 @@ The watchdog saw a keepalive ping since it was last queried.
> WDIOF_SETTIMEOUT Can set/get the timeout
> ================ =======================
>
> -The watchdog can do pretimeouts.
> +The watchdog supports timeout set/get via the WDIOC_SETTIMEOUT and
> +WDIOC_GETTIMEOUT ioctls.
>
> ================ ================================
> WDIOF_PRETIMEOUT Pretimeout (in seconds), get/set
> ================ ================================
>
> +The watchdog supports a pretimeout, a warning interrupt that fires before
> +the actual reboot timeout. Use WDIOC_SETPRETIMEOUT and WDIOC_GETPRETIMEOUT
> +to set/get the pretimeout.
> +
> + ================ ================================
> + WDIOF_MAGICCLOSE Supports magic close char
> + ================ ================================
> +
> +The driver supports the Magic Close feature. The watchdog is only disabled
> +if the character 'V' is written to /dev/watchdog before the file descriptor
> +is closed. Without writing 'V' before closing, the watchdog remains active
> +and will trigger a reboot after the timeout expires.
> +
> + ================ ================================
> + WDIOF_ALARMONLY Not a reboot watchdog
> + ================ ================================
> +
> +The watchdog will not reboot the system when it expires. Instead it
> +triggers a management or other external alarm. Userspace should not
> +rely on a system reboot occurring.
>
> For those drivers that return any bits set in the option field, the
> GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current
> @@ -254,6 +274,11 @@ returned value is the temperature in degrees Fahrenheit::
> int temperature;
> ioctl(fd, WDIOC_GETTEMP, &temperature);
>
> +.. note::
> + ``WDIOC_GETTEMP`` is not implemented by the watchdog core and is
> + considered deprecated. It is only supported by a small number of
> + legacy drivers. New drivers should not implement it.
> +
> Finally the SETOPTIONS ioctl can be used to control some aspects of
> the cards operation::
>
> @@ -268,4 +293,24 @@ The following options are available:
> WDIOS_TEMPPANIC Kernel panic on temperature trip
> ================= ================================
>
> -[FIXME -- better explanations]
> +``WDIOS_DISABLECARD`` disables the hardware watchdog timer entirely,
> +allowing a controlled system shutdown without triggering a reboot.
> +Userspace is responsible for pinging the watchdog under normal
> +operation; this ioctl stops the underlying hardware timer so that
> +the absence of pings no longer causes a system reset.
> +
> +``WDIOS_ENABLECARD`` starts the watchdog timer. If the watchdog was
> +previously stopped via ``WDIOS_DISABLECARD``, this will re-enable it. The
> +hardware watchdog will begin counting down from the configured timeout.
> +
> +``WDIOS_TEMPPANIC`` enables temperature-based kernel panic. When set,
> +the driver will call ``panic()`` (or ``kernel_power_off()`` on some
> +drivers) if the hardware temperature sensor exceeds its threshold,
> +rather than only setting the ``WDIOF_OVERHEAT`` status bit. Support
> +for this option is driver-specific; not all watchdog drivers implement
> +temperature monitoring.
> +
> +.. note::
> + ``WDIOS_TEMPPANIC`` is not implemented by the watchdog core and is
> + considered deprecated. It is only present in a small number of
> + legacy drivers. New drivers should not implement it.
^ permalink raw reply
* Re: [PATCH 4/6] hugetlb: drop vma_hugecache_offset() in favor of linear_page_index()
From: jane.chu @ 2026-04-14 17:14 UTC (permalink / raw)
To: Oscar Salvador
Cc: akpm, david, muchun.song, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko, corbet, skhan, hughd, baolin.wang, peterx,
linux-mm, linux-doc, linux-kernel
In-Reply-To: <ad4Og_719Yq4yshf@localhost.localdomain>
On 4/14/2026 2:53 AM, Oscar Salvador wrote:
> On Thu, Apr 09, 2026 at 05:41:55PM -0600, Jane Chu wrote:
>> vma_hugecache_offset() converts a hugetlb VMA address into a mapping
>> offset in hugepage units. While the helper is small, its name is not very
>> clear, and the resulting code is harder to follow than using the common MM
>> helper directly.
>>
>> Use linear_page_index() instead, with an explicit conversion from
>> PAGE_SIZE units to hugepage units at each call site, and remove
>> vma_hugecache_offset().
>>
>> This makes the code a bit more direct and avoids a hugetlb-specific helper
>> whose behavior is already expressible with existing MM primitives.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>
>
> Looks good to me, the only thing is the conversion to hugepage units
> which may not be very clear to the casual reader, but you already
> mentioned that you will add a helper, so all good.
>
>
Yes, will do.
thanks!
-jane
>
^ permalink raw reply
* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Kiryl Shutsemau @ 2026-04-14 17:10 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, Peter Xu, Lorenzo Stoakes, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm
In-Reply-To: <55019037-4f1c-4d9c-83ee-3a844d8f3d5e@kernel.org>
On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote:
> On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> >
> > == Problem ==
> >
> > VMMs managing guest memory need to:
> > 1. Track which pages are actively used (working set detection)
> > 2. Safely evict cold pages to slower storage
> > 3. Fetch pages back on demand when accessed again
> >
> > For shmem-backed guest memory, working set tracking partially works
> > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
> > re-access auto-resolves from cache. But safe eviction still requires
> > synchronous fault interception to prevent data loss races.
> >
> > For anonymous guest memory (needed for KSM cross-VM deduplication),
> > there is no mechanism at all — clearing a PTE loses the page.
> >
> > == Solution ==
> >
> > The series introduces a unified userfaultfd interface that works
> > across both anonymous and shmem-backed memory:
> >
> > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
> > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
> > balancing) to make pages inaccessible without freeing them.
>
> I would rather tackle this from the other direction: it's another form
> of protection (like WP), not really a "minor" mode.
>
> Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it)
> and support it for anon+shmem, avoiding the zapping for shmem completely?
I like this idea.
It should be functionally equivalent, but your interface idea fits
better with the rest.
Thanks! Will give it a try.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox