Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH V10 00/10] famfs: port into fuse
From: Joanne Koong @ 2026-04-14 22:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Miklos Szeredi, Bernd Schubert, John Groves,
	Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
	Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
	Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
	Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
	Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
	Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <20260414185740.GA604658@frogsfrogsfrogs>

On Tue, Apr 14, 2026 at 11:57 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> > On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > Overall, my intention with bringing this up is just to make sure we're
> > > > at least aware of this alternative before anything is merged and
> > > > permanent. If Miklos and you think we should land this series, then
> > > > I'm on board with that.
> > >
> > > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > > absolutely necessary.  This was the main sticking point originally,
> > > but there seemed to be no better alternative.
> > >
> > > However with the bpf approach this would be gone, which is great.
>
> Well... you can't get away with having *no* mapping interface at all.

Yes but the mapping interface should be *generic*, not one that is so
specifically tailored to one server. fuse will have to support this
forever.

> You still have to define a UABI that BPF programs can use to convey
> mapping data into fsdax/iomap.  BTF is a nice piece of work that smooths
> over minor fluctuations in struct layout between a running kernel and
> a precompiled BPF program, but fundamentally we still need a fuse-native
> representation.
>
> That last sentence was an indirect way of saying: No, we're not going
> to export struct iomap to userspace.  The fuse-iomap patchset provides
> all the UABI pieces we need for regular filesystems (ext4) and hardware
> adjacent filesystems (famfs) to exchange file mapping data with the
> kernel.  This has been out for review since last October, but the lack
> of engagement with that patchset (or its February resubmission) doesn't
> leave me with confidence that any of it is going anywhere.
>
> Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
> upload bpf programs to generate interleaved mappings.  It's not so hard
> to convert famfs' iomapping paths to use fuse-iomap, but I haven't
> helped him do that because:
>
> a) I have no idea what Miklos' thoughts are about merging any of the
> famfs stuff.
>
> b) I also have no idea what his thoughts are about fuse-iomap.  The
> sparse replies are not encouraging.
>
> c) It didn't seem fair to John to make him take on a whole new patchset
> dependency given (a) and (b).
>
> d) Nobody ever replied to my reply to the LSFMM thread about "can we do
> some code review of fuse iomap without waiting three months for LSFMM?"
> I've literally done nothing with fuse-iomap for two of the three months
> requested.
>
> > > So let us please at least have a try at this. I'm not into bpf yet,
> > > but willing to learn.
>
> I sent out the patches to enable exactly this sort of experimentation
> two months ago, and have not received any responses:
>
> https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
>
> I would like to say this as gently as possible: I don't know what the
> problem here is, Miklos -- are you uninterested in the work?  Do you
> have too many other things to do inside RH that you can't talk about?
> Is it too difficult to figure out how the iomap stuff fits into the rest
> of the fuse codebase?  Do you need help from the rest of us to get
> reviews done?  Is there something else with which I could help?
>
> Because ... over the past few years, many of my team's filesystem
> projects have endured monthslong review cycles and often fail to get
> merged.  This has led to burnout and frustration among my teammates such
> that many of them chose to move on to other things.  For the remaining
> people, it was very difficult to justify continuing headcount when
> progress on projects is so slow that individuals cannot achieve even one
> milestone per quarter on any project.
>
> There's now nobody left here but me.
>
> I'm not blaming you (Miklos) for any of this, but that is the current
> deplorable state of things.
>
> > > Thanks,
> > > Miklos
> >
> > Thanks for responding...
> >
> > My short response: Noooooooooo!!!!!!
> >
> > I very strongly object to making this a prerequisite to merging. This
> > is an untested idea that will certainly delay us by at least a couple
> > of merge windows when products are shipping now, and the existing approach
> > has been in circulation for a long time. It is TOO LATE!!!!!!
>
> /me notes that has "we're shipping so you have to merge it over peoples'
> concerns" rarely carries the day in LKML land, and has never ended well
> in the few cases that it happens.  As Ted is fond of saying, this is a
> team sport, not an individual effort.  Unfortunately, to abuse your
> sports metaphor, we all play for the ******* A's.
>
> That said, you're clearly pissed at the goalposts changing yet again,
> and that's really not fair that we collectively keep moving them.
>
> It's a rotten situation that I could have even helped you to solve both
> our problems via fuse-iomap, but I just couldn't motivate myself to
> entwine our two projects until the technical direction questions got
> answered.
>
> > Famfs is not a science project, it's enablement for actual products and
> > early versions are available now!!!
> >
> > That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
>
> Heck, the fuse command field is a u32.  There are plenty of numberspace
> left, and the kernel can just *stop issuing them*.

I don't think the problem is the command field. As I understand it, if
this lands and is converted over later, none of the famfs code in this
series can be removed from fuse. If fuse has native non-bpf support
for famfs, then it will always need to have that. That's the part that
worries me.

>
> > What are the risks of converting to BPF?

I think maybe there is a misinterpretation of what the alternative
approach entails. From my point of view, the alternative approach is
not that different from what is already in this series. The only piece
of the famfs logic that would need to use bpf is the logic for
finding/computing the extent mappings (which is the famfs-specific
logic that would not be applicable to any other server). That famfs
bpf code is minimal and already written [1], as it is just the logic
that is in patch 6 [2] in this series copied over. No other part of
famfs touches bpf. The rest is renaming the functions in
fs/fuse/famfs.c to generic fuse_iomap_dax_XXX names (the logic is the
same logic in this series, eg invoking the lower-level calls to
dax_iomap_rw/fault/etc) and moving the daxdev setup/initialization to
connection initialization time where the server passes that daxdev
setup info/configs upfront. I don't think this would delay things by
several merge windows, as the code is already mostly written. If it
would be helpful, I can clean up what's in the prototype and send that
out.

I think the part that is not clear yet and needs to be verified is
whether this approach runs into any technical limitations on famfs's
production workloads. For example, does the overhead of using bpf maps
lead to a noticeable performance drop on real workloads? In the
future, will there be too many extent mappings on high-scale systems
to make this feasible? etc. If there are technical reasons why the
famfs logic has to be in fuse, then imo we should figure that out and
ideally that's the discussion we should be having. I am not a cxl
expert so perhaps there is something missing in the approach that
makes it not sufficient on production systems. If we don't end up
going with the alternative approach, I still think this series should
try to make the famfs uapi additions to fuse as generic as possible
since that will be irreversible.

If we expedited the alternative approach in terms of reviewing and
merging, would that suffice? Is the main pushback the timing of it, eg
that it would take too long to get reviewed, merged, and shipped?

> >
> > - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> >   curve cost about a year because this is not that similar to anything
> >   else that was already in fuse.
>
> ...and per above, BPF isn't some magic savior that avoids the expansion
> of the UABI.

It doesn't avoid the expansion of the UABI but it makes the UABI
generic (eg plenty of future servers can/will use the generic iomap
layer).

>
> > - Those of us who are involved don't fully understand either the security
> >   or performance implications of this. It
>
> Correct.  I sure think it's swell that people can inject IR programs
> that jit/link into the kernel.  Don't ask which secondary connotation of
> "swell" I'm talking about.

bpf is used elsewhere in the kernel (eg networking, scheduling). If it
is the case that it is unsafe (which maybe it is, I don't know), then
wouldn't those other areas have the same issues?

>
> > - Famfs is enabling access to memory and mapping fault handling must be
> >   at "memory speed". We know that BPF walks some data structures when a
> >   program executes. That exposes us to additional serialized L3 cache
> >   misses each time we service a mapping fault (any TLB & page table miss).
> >   This should be studied side-by-side with the existing approach under
> >   multiple loads before being adopted for production.
>
> Yes, it should.  AFAICT if one switched to a per-inode bpf program, then
> you could do per-inode bpf programs.  Then you don't even need the bpf
> map, and the ->iomap_begin becomes an indirect call into JITted x86_64
> math code.
>
> (The downside is that dyn code can't be meaningfully signed, requires
> clang on the system, and you have to deal with inode eviction issues.)
>
> > - This has never been done in production, and we're throwing it in the way
> >   of a project that has been soaking for years and needs to support early
> >   shipments of products.
>
> Correct.  I haven't even implemented BPF-iomap for fuse4fs.  This BPF
> integration stuff is *highly* experimental code.

I think what fuse4fs needs for bpf is significantly more complicated
and intensive than what famfs needs. For famfs, the extent mapping
logic is straightforward computation.

>
> > If this is the only path, I'd like to revive famfs as a standalone file
> > system. I'm still maintaining that and it's still in use.
>
> Honestly, you should probably just ship that to your users.  As long as
> the ondisk format doesn't change much, switching the implementation at a
> later date is at least still possible.

I recognize this is an unfair situation John as you've already spent
years working on this and did what the community asked with rewriting
it. What I'm hoping to convey is that the approach where the extent
computing/finding logic gets moved to bpf is not radically different
from the famfs logic already in this patchset. In my view, moving this
logic to bpf is more advantageous for both fuse *and* famfs
(decoupling famfs releases from kernel releases) - it would be great
to consider this on technical merits if expediting the timeline of the
alternative approach would suffice.

Thanks,
Joanne

[1] https://github.com/joannekoong/libfuse/blob/444fa27fa9fd2118a0dc332933197faf9bbf25aa/example/famfs.bpf.c
[2] https://lore.kernel.org/linux-fsdevel/0100019d43e79794-0eadcf5e-b659-43f7-8fdc-dec9f4ccce14-000000@email.amazonses.com/

>
> --D

^ permalink raw reply

* Re: [PATCH v2 04/12] tick/nohz: Transition to dynamic full dynticks state management
From: Thomas Gleixner @ 2026-04-14 21:57 UTC (permalink / raw)
  To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
	Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
	Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
	Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-4-06df21caba5d@gmail.com>

On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:

> Context:
> Full dynticks (NOHZ_FULL) is typically a static configuration determined
> at boot time. DHEI extends this to support runtime activation.

I have no idea what DHEI is. Provide proper information and not magic
acronyms.

> Problem:
> Switching to NOHZ_FULL at runtime requires careful synchronization
> of context tracking and housekeeping states. Re-invoking setup logic
> multiple times could lead to inconsistencies or warnings, and RCU
> dependency checks often prevented tick suppression in Zero-Conf setups.

And that careful synchronization is best achieved with an opaque
notifier callchain which relies on build time ordering. Impressive.

> Solution:
> - Replace the static tick_nohz_full_enabled() checks with a dynamic
>   tick_nohz_full_running state variable.

That variable existed before and you are telling the what and not why
this is required and how that is correct vs. the other checks in
tick_nohz_full_enabled(). Also what's static about that function aside
of being marked static inline?

> - Refactor tick_nohz_full_setup to be safe for runtime invocation,
>   adding guards against re-initialization and ensuring IRQ work
>   interrupt support.

Refactoring has to be done in a preparatory patch and not 

> - Implement boot-time pre-activation of context tracking (shadow
>   init) for all possible CPUs to avoid instruction flow issues during
>   dynamic transitions.

Again lot's of hand waving without a proper explanation.

> - Hook into housekeeping_notifier_list to update NO_HZ states dynamically.

See above.

> This provides the core state machine for reliable, on-demand tick
> suppression and high-performance isolation.

I can find a lot of hacks, but definitely not the slightest notion of a
state machine. Don't throw random buzzwords into a changelog if there is
no evidence for their existance.

> +static int tick_nohz_housekeeping_reconfigure(struct notifier_block *nb,
> +					     unsigned long action, void *data)
> +{
> +	struct housekeeping_update *upd = data;
> +	int cpu;
> +
> +	if (action == HK_UPDATE_MASK && upd->type == HK_TYPE_TICK) {
> +		cpumask_var_t non_housekeeping_mask;
> +
> +		if (!alloc_cpumask_var(&non_housekeeping_mask, GFP_KERNEL))
> +			return NOTIFY_BAD;
> +
> +		cpumask_andnot(non_housekeeping_mask, cpu_possible_mask, upd->new_mask);
> +
> +		if (!tick_nohz_full_mask) {
> +			if (!zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
> +				free_cpumask_var(non_housekeeping_mask);
> +				return NOTIFY_BAD;
> +			}
> +		}
> +
> +		/* Kick all CPUs to re-evaluate tick dependency before change */
> +		for_each_online_cpu(cpu)
> +			tick_nohz_full_kick_cpu(cpu);

That solves what?

> +		cpumask_copy(tick_nohz_full_mask, non_housekeeping_mask);

What's the exact point of this non_housekeeping_mask?

Why can't you simply do:

		cpumask_andnot(tick_nohz_full_mask, cpu_possible_mask, upd->new_mask);

That'd be too simple and comprehensible, right?

> +		tick_nohz_full_running = !cpumask_empty(tick_nohz_full_mask);
> +
> +		/*
> +		 * If nohz_full is running, the timer duty must be on a housekeeper.
> +		 * If the current timer CPU is not a housekeeper, or no duty is assigned,
> +		 * pick the first housekeeper and assign it.
> +		 */
> +		if (tick_nohz_full_running) {
> +			int timer_cpu = READ_ONCE(tick_do_timer_cpu);

New line between declaration and code.

> +			if (timer_cpu == TICK_DO_TIMER_NONE ||
> +			    !cpumask_test_cpu(timer_cpu, upd->new_mask)) {

No line break required. You have 100 characters

> +				int next_timer = cpumask_first(upd->new_mask);

next_timer? Please pick variable names which are comprehensible and self
explaining. Also why can't you re-use timer_cpu, which would be actually useful?


> +				if (next_timer < nr_cpu_ids)

How can upd->new_mask be empty? That'd be a bug, no?

> +					WRITE_ONCE(tick_do_timer_cpu, next_timer);
> +			}
> +		}
> +
> +		/* Kick all CPUs again to apply new nohz full state */
> +		for_each_online_cpu(cpu)
> +			tick_nohz_full_kick_cpu(cpu);

This whole thing lacks an explanation why it is even remotely correct.

> void __init tick_nohz_init(void)
...
> +	if (!tick_nohz_full_mask) {
> +		if (!slab_is_available())
> +			alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> +		else
> +			zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL);
>  	}

I've seen the same code sequence before. Copy & paste is simpler than
providing helper functions.....
  
> -	if (IS_ENABLED(CONFIG_PM_SLEEP_SMP) &&
> -			!IS_ENABLED(CONFIG_PM_SLEEP_SMP_NONZERO_CPU)) {
> -		cpu = smp_processor_id();
> +	housekeeping_register_notifier(&tick_nohz_housekeeping_nb);
>  
> -		if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
> -			pr_warn("NO_HZ: Clearing %d from nohz_full range "
> -				"for timekeeping\n", cpu);
> -			cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> +	if (tick_nohz_full_running) {

This indentation and the resulting goto mess can be completely avoided
if you actually refactor the code and not just claim to do so.

Again, this does too many things at once and then explains them badly,
which makes it unreviewable.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v2 02/12] sched/isolation: Introduce housekeeping notifier infrastructure
From: Thomas Gleixner @ 2026-04-14 21:25 UTC (permalink / raw)
  To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
	Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
	Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
	Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-2-06df21caba5d@gmail.com>

On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:
>  
> +int housekeeping_register_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&housekeeping_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_register_notifier);
> +
> +int housekeeping_unregister_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&housekeeping_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_unregister_notifier);

As I said before, notifiers are a horrible interface especially for
things where most callers are built-in. Especially providing proper
ordering of the callbacks is a badly defined mechanism as demonstrated
by the now eliminated CPU hotplug notifiers.

> +int housekeeping_update_notify(enum hk_type type, const struct cpumask *new_mask)
> +{
> +	struct housekeeping_update update = {
> +		.type = type,
> +		.new_mask = new_mask,
> +	};
> +
> +	return blocking_notifier_call_chain(&housekeeping_notifier_list, HK_UPDATE_MASK, &update);
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_update_notify);

Why is this exported? Are random modules allowed to invoke this?

^ permalink raw reply

* Re: [PATCH v2 05/12] genirq: Support dynamic migration for managed interrupts
From: Thomas Gleixner @ 2026-04-14 21:21 UTC (permalink / raw)
  To: Qiliang Yuan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar, Tejun Heo,
	Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Waiman Long,
	Chen Ridong, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, linux-mm, cgroups, linux-doc, linux-kselftest,
	Qiliang Yuan
In-Reply-To: <20260413-wujing-dhm-v2-5-06df21caba5d@gmail.com>

On Mon, Apr 13 2026 at 15:43, Qiliang Yuan wrote:
> +	irq_lock_sparse();
> +	for_each_active_irq(irq) {
> +		struct irq_data *irqd;

Please move the declaration into the scope where it is used.

> +		struct irq_desc *desc;
> +
> +		desc = irq_to_desc(irq);
> +		if (!desc)
> +			continue;
> +
> +		scoped_guard(raw_spinlock_irqsave, &desc->lock) {
> +			irqd = irq_desc_get_irq_data(desc);
> +			if (!irqd_affinity_is_managed(irqd) || !desc->action ||
> +			    !irq_data_get_irq_chip(irqd))
> +				continue;

That's a pretty random choice of conditions.

> +			/*
> +			 * Re-apply existing affinity to honor the new
> +			 * housekeeping mask via __irq_set_affinity() logic.
> +			 */
> +			irq_set_affinity_locked(irqd, irq_data_get_affinity_mask(irqd), false);

That's not sufficient. Assume an interrupt was shut down before the
change because there was no online CPU in the affinity mask, but now the
affinity mask changes so there is an online CPU. What starts it up?
Same the other way around.

> +static struct notifier_block irq_housekeeping_nb = {
> +	.notifier_call = irq_housekeeping_reconfigure,
> +};
> +
> +static int __init irq_init_housekeeping_notifier(void)
> +{
> +	housekeeping_register_notifier(&irq_housekeeping_nb);
> +	return 0;
> +}
> +core_initcall(irq_init_housekeeping_notifier);

I fundamentaly despise notifiers especially when they are just invoking
something which is built in.

^ permalink raw reply

* Re: [RFC PATCH] Documentation: Add managed interrupts
From: Aaron Tomlin @ 2026-04-14 20:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Valentin Schneider, linux-doc, linux-kernel, Christoph Hellwig,
	Frederic Weisbecker, Jens Axboe, Jonathan Corbet, Ming Lei,
	Thomas Gleixner, Waiman Long, Peter Zijlstra, John Ogness
In-Reply-To: <20260413155726.BpD5Eh0T@linutronix.de>

On Mon, Apr 13, 2026 at 05:57:26PM +0200, Sebastian Andrzej Siewior wrote:
> For the managed_irq you could argue that this could also use some
> runtime configuration at which point isolcpus= would have a runtime
> counterpart and could be removed.
> After going through all this I concluded that it makes hardly sense
> since you would require callbacks in every driver using it or other
> magic "to reconfigure" but it already makes little sense using it.
> 
> Either way, I don't see anything wrong with using isolcpus=domain if you
> have a static setup and need/ want reconfigure at runtime.
> 
Hi Sebastian,

I completely agree.

-- 
Aaron Tomlin

^ permalink raw reply

* [PATCH] docs: kernel-parameters: document scope of irqaffinity= parameter
From: Aaron Tomlin @ 2026-04-14 20:02 UTC (permalink / raw)
  To: corbet, skhan
  Cc: tglx, akpm, bp, rdunlap, dave.hansen, feng.tang,
	pawan.kumar.gupta, dapeng1.mi, kees, elver, paulmck, lirongqing,
	bhelgaas, linux-doc, linux-kernel

System administrators frequently use the "irqaffinity=" boot parameter
in conjunction with CPU isolation to build deterministic, latency-free
environments. However, there is a widespread misconception that
"irqaffinity=" acts as a global, absolute override for all hardware
interrupts.

In reality, "irqaffinity=" strictly populates the irq_default_affinity
mask. When the kernel allocates multiqueue vectors
(e.g., irq_create_affinity_masks()), it explicitly bypasses this default
mask for managed interrupts. Instead, it relies on dynamic spreading
algorithms to map queues to the available topology, effectively
overriding any default the administrator set via the command line.

This patch explicitly documents this limitation in kernel-parameters.txt
to set correct expectations and directs users to the appropriate
"isolcpus=" sub-parameters for managed interrupt isolation.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9ed7c3ecd158..40ca92d8cf04 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2732,6 +2732,14 @@ Kernel parameters
 	irqaffinity=	[SMP] Set the default irq affinity mask
 			The argument is a cpu list, as described above.

+			Note: This parameter only sets the default affinity
+			for unmanaged interrupts (e.g., legacy single-queue
+			devices or unmanaged pre/post vectors). It is
+			explicitly ignored by managed interrupts, such as
+			those utilised by modern multiqueue storage
+			controllers. To isolate CPUs from managed
+			interrupts, see the "managed_irq".
+
 	irqchip.gicv2_force_probe=
 			[ARM,ARM64,EARLY]
 			Format: <bool>
-- 
2.51.0

^ permalink raw reply related

* Re: [PATCH v5 00/24] ARM64 PMU Partitioning
From: Colton Lewis @ 2026-04-14 19:55 UTC (permalink / raw)
  To: Colton Lewis
  Cc: will, oupton, kvm, pbonzini, corbet, linux, catalin.marinas, maz,
	oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest
In-Reply-To: <gsntpl7a694p.fsf@coltonlewis-kvm.c.googlers.com>

Colton Lewis <coltonlewis@google.com> writes:

> Will Deacon <will@kernel.org> writes:

>> On Tue, Dec 09, 2025 at 03:00:59PM -0800, Oliver Upton wrote:
>>> On Tue, Dec 09, 2025 at 08:50:57PM +0000, Colton Lewis wrote:
>>> > This series creates a new PMU scheme on ARM, a partitioned PMU that
>>> > allows reserving a subset of counters for more direct guest access,
>>> > significantly reducing overhead. More details, including performance
>>> > benchmarks, can be read in the v1 cover letter linked below.
>>> >
>>> > An overview of what this series accomplishes was presented at KVM
>>> > Forum 2025. Slides [1] and video [2] are linked below.
>>> >
>>> > The long duration between v4 and v5 is due to time spent on this
>>> > project being monopolized preparing this feature for internal
>>> > production. As a result, there are too many improvements to fully list
>>> > here, but I will cover the notable ones.

>>> Thanks for reposting. I think there's still quite a bit of ground to
>>> cover on the KVM side of this, but I would definitely appreciate it if
>>> someone with more context on the perf side of things could chime in.

>>> Will, IIRC you had some thoughts around counter allocation, right?

>> Right, I was hoping that the host counter reservation could be more
>> dynamic than a cmdline option. Perf already has support for pinning
>> events to a CPU, so the concept of some counters being unavailable
>> shouldn't be too much for the driver to handle. You might just need to
>> create some fake pinned events so that perf code understands what is
>> happening.

> Thanks Will. I have a few followup questions:

> 1. Are you suggesting this be done whenever we enter a guest so the host
> always has access to the full range in host context? That would be the
> most dynamic.

> 2. How should we handle the possibility a real event already occupies a
> counter wanted by the guest? Is there a good way to create our fake
> pinned events then force a reschedule so perf moves the real events out
> of the way?

> 3. Is there an existing fake event type that tells perf not to touch
> hardware?

> 4. Can you point to any example code that already does something like
> this?

Thank you Will and Mark for meeting with me to discuss things in person.

Here's my main takeaways so the list can comment:

Will's initial idea doesn't work because there is no way for KVM to pin
counters in a way that takes priority over counters pinned by the host
and therefore guarantee reservation.

An alternate idea I am proposing is to call the perf core
sched_in/sched_out functionality during vcpu_load/vcpu_put when guest
counters need to be reserved/unreserved.

That means having perf vacate all the host counters temporarily,
modifying the arm_pmu.cntr_mask to add/remove the appropriate counters,
then having perf schedule all host events back on the new set. Perf is
capable of doing that without any significant changes.

This is simple and should work because arm_pmu.cntr_mask is already
accessible from the vcpu struct and modifying it is already how the
existing boot-time counter reservation works.

There are some tradeoffs to this approach that will need further
consideration. The first is how to handle event groups. Perf allows
events to be grouped such that they must all be scheduled in at once. If
the host has a larger group than the number of counters available while
the vcpu is loaded, then it simply won't be able to schedule that group
in for that time period. Another is whether it will be acceptable
performance-wise to put perf sched_in/sched_out in
vcpu_load/vcpu_put. I'm unsure how much delay that would add to those
paths.

Absent strong objections, I will be posting a series using this method.

Another idea that was not discussed that I had later is a middle
approach that is less dynamic but gives the user control over when the
perf sched_in/sched_out happens. Expose the existing boot-time parameter
as writable in sysfs and do the sched_out/modify mask/sched_in when that
is written rather than in vcpu_load.

^ permalink raw reply

* Re: [PATCH V10 00/10] famfs: port into fuse
From: Darrick J. Wong @ 2026-04-14 18:57 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, John Groves,
	Dan Williams, Bernd Schubert, Alison Schofield, John Groves,
	Jonathan Corbet, Shuah Khan, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
	Christian Brauner, Randy Dunlap, Jeff Layton, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
	Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
	Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
	Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <ad4_jFsR951c2Mtn@groves.net>

On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > 
> > > Overall, my intention with bringing this up is just to make sure we're
> > > at least aware of this alternative before anything is merged and
> > > permanent. If Miklos and you think we should land this series, then
> > > I'm on board with that.
> > 
> > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > absolutely necessary.  This was the main sticking point originally,
> > but there seemed to be no better alternative.
> > 
> > However with the bpf approach this would be gone, which is great.

Well... you can't get away with having *no* mapping interface at all.
You still have to define a UABI that BPF programs can use to convey
mapping data into fsdax/iomap.  BTF is a nice piece of work that smooths
over minor fluctuations in struct layout between a running kernel and
a precompiled BPF program, but fundamentally we still need a fuse-native
representation.

That last sentence was an indirect way of saying: No, we're not going
to export struct iomap to userspace.  The fuse-iomap patchset provides
all the UABI pieces we need for regular filesystems (ext4) and hardware
adjacent filesystems (famfs) to exchange file mapping data with the
kernel.  This has been out for review since last October, but the lack
of engagement with that patchset (or its February resubmission) doesn't
leave me with confidence that any of it is going anywhere.

Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
upload bpf programs to generate interleaved mappings.  It's not so hard
to convert famfs' iomapping paths to use fuse-iomap, but I haven't
helped him do that because:

a) I have no idea what Miklos' thoughts are about merging any of the
famfs stuff.

b) I also have no idea what his thoughts are about fuse-iomap.  The
sparse replies are not encouraging.

c) It didn't seem fair to John to make him take on a whole new patchset
dependency given (a) and (b).

d) Nobody ever replied to my reply to the LSFMM thread about "can we do
some code review of fuse iomap without waiting three months for LSFMM?"
I've literally done nothing with fuse-iomap for two of the three months
requested.

> > So let us please at least have a try at this. I'm not into bpf yet,
> > but willing to learn.

I sent out the patches to enable exactly this sort of experimentation
two months ago, and have not received any responses:

https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/

I would like to say this as gently as possible: I don't know what the
problem here is, Miklos -- are you uninterested in the work?  Do you
have too many other things to do inside RH that you can't talk about?
Is it too difficult to figure out how the iomap stuff fits into the rest
of the fuse codebase?  Do you need help from the rest of us to get
reviews done?  Is there something else with which I could help?

Because ... over the past few years, many of my team's filesystem
projects have endured monthslong review cycles and often fail to get
merged.  This has led to burnout and frustration among my teammates such
that many of them chose to move on to other things.  For the remaining
people, it was very difficult to justify continuing headcount when
progress on projects is so slow that individuals cannot achieve even one
milestone per quarter on any project.

There's now nobody left here but me.

I'm not blaming you (Miklos) for any of this, but that is the current
deplorable state of things.

> > Thanks,
> > Miklos
> 
> Thanks for responding...
> 
> My short response: Noooooooooo!!!!!!
> 
> I very strongly object to making this a prerequisite to merging. This
> is an untested idea that will certainly delay us by at least a couple
> of merge windows when products are shipping now, and the existing approach
> has been in circulation for a long time. It is TOO LATE!!!!!!

/me notes that has "we're shipping so you have to merge it over peoples'
concerns" rarely carries the day in LKML land, and has never ended well
in the few cases that it happens.  As Ted is fond of saying, this is a
team sport, not an individual effort.  Unfortunately, to abuse your
sports metaphor, we all play for the ******* A's.

That said, you're clearly pissed at the goalposts changing yet again,
and that's really not fair that we collectively keep moving them.

It's a rotten situation that I could have even helped you to solve both
our problems via fuse-iomap, but I just couldn't motivate myself to
entwine our two projects until the technical direction questions got
answered.

> Famfs is not a science project, it's enablement for actual products and
> early versions are available now!!!
> 
> That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.

Heck, the fuse command field is a u32.  There are plenty of numberspace
left, and the kernel can just *stop issuing them*.

> What are the risks of converting to BPF?
> 
> - I don't know how to do it - so it'll be slow (kinda like my fuse learning
>   curve cost about a year because this is not that similar to anything
>   else that was already in fuse.

...and per above, BPF isn't some magic savior that avoids the expansion
of the UABI.

> - Those of us who are involved don't fully understand either the security
>   or performance implications of this. It 

Correct.  I sure think it's swell that people can inject IR programs
that jit/link into the kernel.  Don't ask which secondary connotation of
"swell" I'm talking about.

> - Famfs is enabling access to memory and mapping fault handling must be
>   at "memory speed". We know that BPF walks some data structures when a 
>   program executes. That exposes us to additional serialized L3 cache 
>   misses each time we service a mapping fault (any TLB & page table miss).
>   This should be studied side-by-side with the existing approach under
>   multiple loads before being adopted for production.

Yes, it should.  AFAICT if one switched to a per-inode bpf program, then
you could do per-inode bpf programs.  Then you don't even need the bpf
map, and the ->iomap_begin becomes an indirect call into JITted x86_64
math code.

(The downside is that dyn code can't be meaningfully signed, requires
clang on the system, and you have to deal with inode eviction issues.)

> - This has never been done in production, and we're throwing it in the way
>   of a project that has been soaking for years and needs to support early
>   shipments of products.

Correct.  I haven't even implemented BPF-iomap for fuse4fs.  This BPF
integration stuff is *highly* experimental code.

> If this is the only path, I'd like to revive famfs as a standalone file
> system. I'm still maintaining that and it's still in use.

Honestly, you should probably just ship that to your users.  As long as
the ondisk format doesn't change much, switching the implementation at a
later date is at least still possible.

--D

^ permalink raw reply

* [PATCH v4 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 18:12 UTC (permalink / raw)
  To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
  Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
	Derek Barbosa, linux-doc, linux-kernel
In-Reply-To: <ad5_XCnVDlC9Hvup@lx-t490>

Add a configuration guide for real-time kernels.

List all Kconfig options that are recommended to be either enabled or
disabled.  Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.

Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.

Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---

* Changelog v4

 Handle Sashiko's review remarks at
 https://sashiko.dev/#/patchset/ad5_XCnVDlC9Hvup%40lx-t490

 Documentation/core-api/real-time/index.rst    |   1 +
 .../real-time/kernel-configuration.rst        | 310 ++++++++++++++++++
 2 files changed, 311 insertions(+)
 create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst

diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
    differences
    hardware
    architecture-porting
+   kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..73f7730d468e
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,310 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+   :depth: 3
+   :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency.  It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency.  While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not.  It is thus desirable to keep this option enabled.
+
+.. caution::
+
+  A real-time kernel is not about being "as fast as possible", however
+  real-time requirements may demand that the CPU is clocked at a
+  particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution.  Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule.  Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements.  The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity.  Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking.  This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior.  Users should cap the maximum C-state to C1 instead.  For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+  processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements.  For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth.  Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures.  EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``).  For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks.  During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch.  This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default.  If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+  efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+  ``NO_HZ`` modes.  Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+  ``NO_HZ_FULL`` may be beneficial.  In such cases, users should offload
+  the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended.  This will allow the users to extract more information
+if latency problems arise.  Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+  Users should *not* make use of tracers or trace events during production
+  real-time kernel operation as they can add considerable overhead and
+  degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible.  This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled.  In that case,
+users should set a *stable* CPU frequency setting during the system
+runtime, as changing the CPU frequency will increase the system latency and
+affect latency measurements reproducibility.  If a lower CPU frequency is
+desired, then ``CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since it dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. caution::
+
+  During development and early testing, users are encouraged to run their
+  real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+  other kernel debug options enabled, for a considerable amount of time.
+  Such workloads might trigger kernel code paths that were not triggered
+  during the internal Linux real-time kernel development, thus helping to
+  uncover locking and other types of kernel bugs.
+
+Problematic debug options
+-------------------------
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels.  These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact.  It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled.  This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system.  Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace.  All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260227170103.4042157-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
--
2.53.0

^ permalink raw reply related

* Re: [PATCH v10 01/12] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
From: Pawan Gupta @ 2026-04-14 18:05 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-1-efa924abae5f@linux.intel.com>

On Tue, Apr 14, 2026 at 12:05:28AM -0700, Pawan Gupta wrote:
> Currently, the BHB clearing sequence is followed by an LFENCE to prevent
> transient execution of subsequent indirect branches prematurely. However,
> the LFENCE barrier could be unnecessary in certain cases. For example, when
> the kernel is using the BHI_DIS_S mitigation, and BHB clearing is only
> needed for userspace. In such cases, the LFENCE is redundant because ring
> transitions would provide the necessary serialization.
> 
> Below is a quick recap of BHI mitigation options:
> 
> On Alder Lake and newer
> 
>     BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
>     performance overhead.
> 
>     Long loop: Alternatively, a longer version of the BHB clearing sequence
>     can be used to mitigate BHI. It can also be used to mitigate the BHI
>     variant of VMSCAPE. This is not yet implemented in Linux.
> 
> On older CPUs
> 
>     Short loop: Clears BHB at kernel entry and VMexit. The "Long loop" is
>     effective on older CPUs as well, but should be avoided because of
>     unnecessary overhead.
> 
> On Alder Lake and newer CPUs, eIBRS isolates the indirect targets between
> guest and host. But when affected by the BHI variant of VMSCAPE, a guest's
> branch history may still influence indirect branches in userspace. This
> also means the big hammer IBPB could be replaced with a cheaper option that
> clears the BHB at exit-to-userspace after a VMexit.
> 
> In preparation for adding the support for the BHB sequence (without LFENCE)
> on newer CPUs, move the LFENCE to the caller side after clear_bhb_loop() is
> executed. Allow callers to decide whether they need the LFENCE or not. This
> adds a few extra bytes to the call sites, but it obviates the need for
> multiple variants of clear_bhb_loop().
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Tested-by: Jon Kohler <jon@nutanix.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---

Sorry this is missing Boris's Ack, I will fix.

> Acked-by: Borislav Petkov (AMD) <bp@alien8.de>

^ permalink raw reply

* [PATCH v3 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:54 UTC (permalink / raw)
  To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
  Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
	Derek Barbosa, linux-doc, linux-kernel
In-Reply-To: <20260414174159.1271171-2-darwi@linutronix.de>

Add a configuration guide for real-time kernels.

List all Kconfig options that are recommended to be either enabled or
disabled.  Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.

Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.

Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---

* Changelog-v3

Order the "Problematic debug options" section alphabetically, thus matching
rest of the document.  Link to v2 of bigeasy EFI runtime services work,
instead of v1.

 Documentation/core-api/real-time/index.rst    |   1 +
 .../real-time/kernel-configuration.rst        | 313 ++++++++++++++++++
 2 files changed, 314 insertions(+)
 create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst

diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
    differences
    hardware
    architecture-porting
+   kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..ab06ec2c6ef8
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,313 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+   :depth: 3
+   :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency.  It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency.  While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not.  It is thus desirable to keep this option enabled.
+
+.. caution::
+
+  A real-time kernel is not about being "as fast as possible", however
+  real-time requirements may demand that the CPU is clocked at a
+  particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution.  Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule.  Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements.  The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity.  Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking.  This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior.  Users should cap the maximum C-state to C1 instead.  For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+  processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements.  For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth.  Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures.  EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``).  For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks.  During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch.  This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default.  If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+  efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+  ``NO_HZ`` modes.  Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+  ``NO_HZ_FULL`` may be beneficial.  In such cases, users should offload
+  the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended.  This will allow the users to extract more information
+if latency problems arise.  Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+  Users should *not* make use of tracers or trace events during production
+  real-time kernel operation as they can add considerable overhead and
+  degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible.  This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled.  In that case, users
+should set a *stable* CPU frequency setting during the system runtime, as
+changing the CPU frequency will increase the system latency and affect
+latency measurements reproducibility.  If a lower CPU frequency is desired,
+then ``CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since in dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. TODO: Connect lockdep with PROVE_LOCKING.  Make it clear that it does
+.. not uncover latency issues.
+
+.. caution::
+
+  During development and early testing, users are encouraged to run their
+  real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+  other kernel debug options enabled, for a considerable amount of time.
+  Such workloads might trigger kernel code paths that were not triggered
+  during the internal Linux real-time kernel development, thus helping to
+  uncover locking bugs and any real-time latency issues in the kernel.
+
+Problematic debug options
+-------------------------
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels.  These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact.  It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled.  This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system.  Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace.  All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260227170103.4042157-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
--
2.53.0

^ permalink raw reply related

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Peter Xu @ 2026-04-14 17:45 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	James Houghton, Andrea Arcangeli
In-Reply-To: <ad5hAVuRwa_0VNPf@thinkstation>

On Tue, Apr 14, 2026 at 06:08:48PM +0100, Kiryl Shutsemau wrote:
> On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote:
> > Hi, Kiryl,
> > 
> > On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> > > This series adds userfaultfd support for tracking the working set of
> > > VM guest memory, enabling VMMs to identify cold pages and evict them
> > > to tiered or remote storage.
> > 
> > Thanks for sharing this work, it looks very interesting to me.
> > 
> > Personally I am also looking at some kind of VMM memtiering issues.  I'm
> > not sure if you saw my lsfmm proposal, it mentioned the challenge we're
> > facing, it's slightly different but still a bit relevant:
> > 
> > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/
> 
> Thanks will read up. I didn't follow userfultfd work until recently.

Thanks.  Note that the proposal doesn't have much with userfaultfd.  You'll
see when you start reading.

> 
> > Unfortunately, that proposal was rejected upstream.
> 
> Sorry about that. We can chat about in hall track, if you are there :)

I won't be there (as it's rejected.. hence not invited).  But I'm always
happy to discuss on this topic on the list or elsewhere.  Alone the way I
believe it'll also help us to know what is the most acceptable path
forward as it's still very relevant.

> 
> > > == VMM Workflow ==
> > 
> > AFAIU, this workflow provides two functionalities:
> > 
> > > 
> > >     UFFDIO_DEACTIVATE(all)            -- async, no vCPU stalls
> > >     sleep(interval)
> > >     PAGEMAP_SCAN                      -- find cold pages
> > 
> > Until here it's only about page hotness tracking.  I am curious whether you
> > evaluated idle page tracking.  Is it because of perf overheads on rmap?
> 
> I didn't gave idle page tracking much thought. I needed uffd faults to
> serialize reclaim against memory accesses. If use it for one thing we
> can as well try to use it for tracking as well. And it seems to be
> fitting together nicely with sync/async mode flipping.

Yes, I get your point.

It's just that it'll still partly done what access bit has already been
doing for mm core in general on tracking hotness.  So I wonder if we should
still try to see if we can separate the two problems.

One other quick thought is maybe we could also report hotness from kernel
directly rather than relying on async faults, you can refer to "(2) Hotness
Information API" in my above proposal.  Here when it's only about knowing
which page is less frequently used, it's only a READ interface.

> 
> > To
> > me, your solution (until here.. on the hotness sampling) reads more like a
> > more efficient way to do idle page tracking but only per-mm, not per-folio.
> > 
> > That will also be something I would like to benefit if QEMU will decide to
> > do full userspace swap.  I think that's our last resort, I'll likely start
> > with something that makes QEMU work together with Linux on swapping
> > (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
> > currently uses, as long as efficient) then QEMU only cares about the rest,
> > which is what the migration problem is about.
> > 
> > The other issue about idle page tracking to us is, I believe MGLRU
> > currently doesn't work well with it (due to ignoring IDLE bits) where the
> > old LRU algo works.  I'm not sure how much you evaluated above, so it'll be
> > great to share from that perspective too.  I also mentioned some of these
> > challenges in the lsfmm proposal link above.
> > 
> > >     UFFDIO_SET_MODE(sync)             -- block faults for eviction
> > >     pwrite + MADV_DONTNEED cold pages -- safe, faults block
> > >     UFFDIO_SET_MODE(async)            -- resume tracking
> > 
> > These operations are the 2nd function.  It's, IMHO, a full userspace swap
> > system based on userfaultfd.
> 
> Right. And we want to decide where to put cold pages from userspace. 
> 
> > Have you thought about directly relying on userfaultfd-wp to do this work?
> > The relevant question is, why do we need to block guest reads on pages
> > being evicted by the userapp?  Can we still allow that to happen, which
> > seems to be more efficient?  IIUC, only writes / updates matters in such
> > swap system.
> 
> But we do care about about read accesses. We don't want to swap out
> pages that got read-touched. And we cannot in practice switch to WP mode

This is a good point.

When it's considered on top of your above "async trapping to collect
hotness with userfaultfd" idea, it flows naturally with this idea indeed.

However, IMHO that should really be an extremely small window, and the
major part the userapp should rely on is the larger window sampling
whether, in your current case, PROT_NONE (or PTE_NONE for shmem) switched
back to a accessable PTE.

It means using RW protection v.s. WR-ONLY protection will only differ very
slightly if by accident some page got read-only during evicting.  For
example, if the mgmt app monitors PROT_NONE state for 30 seconds, make a
decision to evict, evicting takes 5ms, then within 5ms someone read the
page.  It means it only misses the 5ms/30sec access pattern of guest.

So far I don't yet know if this would justify a new kernel API just for
that small false postive reporting some page is cold but actually it's hot.
To me it's still fine to consider using WP-ONLY and just allow that trivial
window to get refaulted later, because it shouldn't be the majority.

> after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls
> with TLB flushing each.

This is indeed a concern, maybe a bigger one.  I don't know how much
benefit we can get from avoiding one extra TLB flush when evicting.  IMHO
some numbers might be more than great to justify this part.

While at this, I do have a pure question that is relevant on the full
protection scheme (and it can be naive; please bare with me on not yet
reading the whole series): if you change anon mappings to PROT_NONE in
pgtables, then how do the mgmt app reads this page before dumping it to
anywhere? It's not like shmem where you can have a separate mapping.

Do you need to fork(), for example?

> 
> With my approach switching tracking and reclaiming is single bit flip
> under mmap lock.
> 
> > Also, I'm not sure if you're aware of LLNL's umap library:
> > 
> > https://github.com/llnl/umap
> > 
> > That implemnted the swap system using userfaultfd wr-protect mode only, so
> > no new kernel API needed.
> 
> Will look into it. Thanks.

Thanks,

-- 
Peter Xu

^ permalink raw reply

* [PATCH v2 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:41 UTC (permalink / raw)
  To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
  Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
	Derek Barbosa, linux-doc, linux-kernel, Ahmed S. Darwish
In-Reply-To: <20260414174159.1271171-1-darwi@linutronix.de>

Add a configuration guide for real-time kernels.

List all Kconfig options that are recommended to be either enabled or
disabled.  Explicitly add a table of contents at the top of the document,
so that all the options can be seen in a glance.

Whenever appropriate, link to other kernel guides; e.g. cpuidle, cpufreq,
power management, and no_hz.

Add a summary at the end of the document warning users that there is a no
"one size fits all solution" for configuring a real-time system.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
---
 Documentation/core-api/real-time/index.rst    |   1 +
 .../real-time/kernel-configuration.rst        | 313 ++++++++++++++++++
 2 files changed, 314 insertions(+)
 create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst

diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
index f08d2395a22c..a17a3dec535c 100644
--- a/Documentation/core-api/real-time/index.rst
+++ b/Documentation/core-api/real-time/index.rst
@@ -15,3 +15,4 @@ the required changes compared to a non-PREEMPT_RT configuration.
    differences
    hardware
    architecture-porting
+   kernel-configuration
diff --git a/Documentation/core-api/real-time/kernel-configuration.rst b/Documentation/core-api/real-time/kernel-configuration.rst
new file mode 100644
index 000000000000..4310ca85f014
--- /dev/null
+++ b/Documentation/core-api/real-time/kernel-configuration.rst
@@ -0,0 +1,313 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Real-Time Kernel configuration
+==============================
+
+.. contents:: Table of Contents
+   :depth: 3
+   :local:
+
+Introduction
+============
+
+This document lists the kernel configuration options that might affect a
+real-time kernel's worst-case latency.  It is intended for system integrators.
+
+Configuration options
+=====================
+
+``CONFIG_CPU_FREQ``
+-------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+The CPU frequency scaling subsystem ensures that the processor can operate
+at its maximum supported frequency.  While, in general, bootloaders are
+tasked with setting the CPU clock to the highest speed on boot, some do
+not.  It is thus desirable to keep this option enabled.
+
+.. caution::
+
+  A real-time kernel is not about being "as fast as possible", however
+  real-time requirements may demand that the CPU is clocked at a
+  particular speed.
+
+``CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE``
+-------------------------------------------
+
+:Expectation: enabled
+:Severity: *high*
+
+Real-Time workloads expect a fixed CPU frequency during execution.  Using
+the performance governor is an easy way to achieve that purely from kernel
+configuration.
+
+This is not a blanket rule.  Some setups might prefer to clock the CPU to
+lower speeds due to thermal packaging or other requirements.  The key is
+that the CPU frequency remains constant once set.
+
+``CONFIG_CPU_IDLE``
+-------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+CPU idle states (C-states) allow the processor to enter low-power modes
+during periods of inactivity.  Very-low CPU idle states may require
+flushing the CPU caches and lowering or disabling the clocking.  This can
+lower power consumption, but it also increases the entry and exit latency
+from such states.
+
+While disabling this option eliminates cpuidle-related latencies, doing so
+can significantly impact hardware longevity, warranty, and thermal
+behavior.  Users should cap the maximum C-state to C1 instead.  For ACPI
+platforms, this can be achieved by using the boot parameter [1]_::
+
+  processor.max_cstate=1
+
+Higher C-states can be acceptable depending on the user workload's latency
+requirements.  For ACPI-based platforms, use the ``cpupower idle-info``
+command to inspect the available idle states.
+
+For more information, please see:
+
+- ``linux/tools/power/cpupower``
+- :doc:`/admin-guide/pm/cpuidle`
+- :doc:`/admin-guide/pm/index`
+
+``CONFIG_DRM``
+--------------
+
+:Expectation: disabled
+:Severity: *info*
+
+GPU-accelerated workloads can share system resources with the CPU,
+including last-level cache (LLC) and memory bandwidth.  Modern integrated
+GPUs optimize graphics performance at the expense of CPU determinism.
+
+Examples of affected platforms:
+
+- Intel processors with integrated graphics (Gen9 and later)
+- AMD APUs with Radeon Graphics
+- Xilinx Zynq UltraScale+ MPSoC EG/EV series
+
+If graphics workloads must run alongside real-time tasks, users must
+conduct thorough stress testing using tools like ``glmark2`` while
+measuring the overall system latency.
+
+For more information, please check:
+
+- :doc:`Regarding hardware (System memory and cache) </core-api/real-time/hardware>`
+- :doc:`/filesystems/resctrl`
+- `Real-Time and Graphics: A Contradiction?`_
+
+``CONFIG_EFI_DISABLE_RUNTIME``
+------------------------------
+
+:Expectation: enabled
+:Severity: *medium*
+
+EFI is the standard boot and firmware interface for multiple
+architectures.  EFI runtime services provide callback functions to be
+called from the kernel; e.g., as utilized by (``CONFIG_EFI_VARS*``) or
+(``CONFIG_RTC_DRV_EFI``).  For the former, the kernel calls into EFI to
+update the EFI variables.
+
+Calling into EFI means invoking firmware callbacks.  During such
+invocations, the system might not be able to react to interrupts and will
+thus not be able to perform a context switch.  This can cause significant
+latency spikes for the real-time system.
+
+``CONFIG_PREEMPT_RT`` enables this option by default.  If this option is
+manually disabled at build time, the following boot parameter [1]_ may be
+used to disable EFI runtime at boot up::
+
+  efi=noruntime
+
+There is ongoing `development work`_ to allow access to EFI variables for a
+real-time Linux system.
+
+``CONFIG_NO_HZ`` / ``CONFIG_NO_HZ_FULL``
+----------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+Tickless operation can increase kernel-to-userspace transition latency due
+to the extra accounting and state book-keeping.
+
+*Guidance by real-time workload type:*
+
+- For periodic workloads; e.g., control loops executing every 100 µs, avoid
+  ``NO_HZ`` modes.  Consistent kernel ticks are preferable.
+
+- For computation-intensive workloads; e.g. extended userspace execution,
+  ``NO_HZ_FULL`` may be beneficial.  In such cases, users should offload
+  the kernel housekeeping to dedicated CPUs and isolate compute cores.
+
+See also :doc:`/timers/no_hz`.
+
+``CONFIG_PREEMPT_RT``
+---------------------
+
+:Expectation: enabled
+:Severity: **fatal**
+
+This option must be enabled, or the resulting kernel will not be fully
+preemptible and real-time capable.
+
+``CONFIG_TRACING`` (and tracing options)
+----------------------------------------
+
+:Expectation: enabled
+:Severity: *info*
+
+Shipping kernels with tracing support enabled (but not actively running) is
+highly recommended.  This will allow the users to extract more information
+if latency problems arise.  Nonetheless, some tracers do incur latency
+overhead by just being enabled; see :ref:`tracers`.
+
+.. caution::
+
+  Users should *not* make use of tracers or trace events during production
+  real-time kernel operation as they can add considerable overhead and
+  degrade the system's latency.
+
+Non-performance CPU frequency governors
+---------------------------------------
+
+:Expectation: disabled
+:Severity: *medium*
+
+To ensure reproducible system latency measurements, disable the
+non-``PERFORMANCE`` CPU frequency governors when possible.  This avoids the
+risk of unknown userspace tasks implicitly or explicitly setting a
+different CPU frequency governor, and thus achieving different latency
+results across the system's runtime.
+
+If disabling other frequency governors is not an option, then
+``CPU_FREQ_DEFAULT_GOV_USERSPACE`` should be enabled.  In that case, users
+should set a *stable* CPU frequency setting during the system runtime, as
+changing the CPU frequency will increase the system latency and affect
+latency measurements reproducibility.  If a lower CPU frequency is desired,
+then ``CPU_FREQ_DEFAULT_GOV_POWERSAVE`` should be set.
+
+The ``ONDEMAND`` CPU frequency governor should *not* be enabled in a
+real-time system since in dramatically affects determinism depending on the
+workload.
+
+For more information, please check :doc:`/admin-guide/pm/cpufreq`.
+
+Kernel Debug Options
+====================
+
+Most kernel debug options add runtime overhead that increases the
+worst-case latency.
+
+.. TODO: Connect lockdep with PROVE_LOCKING.  Make it clear that it does
+.. not uncover latency issues.
+
+.. caution::
+
+  During development and early testing, users are encouraged to run their
+  real-time workloads and peripherals with lockdep (:ref:`lockdep`) and
+  other kernel debug options enabled, for a considerable amount of time.
+  Such workloads might trigger kernel code paths that were not triggered
+  during the internal Linux real-time kernel development, thus helping to
+  uncover locking bugs and any real-time latency issues in the kernel.
+
+Problematic debug options
+-------------------------
+
+``CONFIG_LOCKUP_DETECTOR``
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+The lockup detector creates kernel timer callbacks that execute every few
+seconds, in hard-IRQ context, even on real-time kernels.  These periodic
+interrupts can cause latency spikes.
+
+Users should use hardware watchdogs instead, which will provide a similar
+functionality without the software-induced latency.
+
+.. _lockdep:
+
+``CONFIG_PROVE_LOCKING``
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+Proving the correctness of all kernel locking adds substantial overhead
+and significantly increases worst-case latency.
+
+.. _tracers:
+
+``CONFIG_IRQSOFF_TRACER`` and ``CONFIG_PREEMPT_TRACER``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Severity: *high*
+
+These tracers do incur measurable latency overhead even when tracing is not
+currently active.
+
+Allowed kernel debug options
+----------------------------
+
+Kernel debug options which are not included in this list should be enabled
+with caution, after extensive auditing of their impact on system latency.
+
+``CONFIG_DEBUG_ATOMIC_SLEEP``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This sanity check catches common kernel programming errors with
+a tolerable latency cost.
+
+``CONFIG_DEBUG_BUGVERBOSE``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This improves the debugging capabilities without affecting normal
+operation latency.
+
+``CONFIG_DEBUG_FS``
+^^^^^^^^^^^^^^^^^^^
+
+This is safe to include in real-time kernels, *provided that debugfs is
+not accessed during production runtime*.
+
+``CONFIG_DEBUG_INFO``
+^^^^^^^^^^^^^^^^^^^^^
+
+This increases the kernel image size but has no latency impact.  It is
+also essential for meaningful crash dumps and profiling.
+
+``CONFIG_DEBUG_KERNEL``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Meta-option which allows debug features to be enabled.  This configuration
+option has no runtime impact, but be aware of any debug features that it
+may have allowed to be enabled.
+
+Summary
+=======
+
+There is no "one size fits all" solution for configuring a real-time Linux
+system.  Beginning with the system real-time requirements, integrators
+must consider the features and functions of the system's hardware, kernel,
+and userspace.  All such components must be properly configured in order
+to establish and constrain the system's maximum latency.
+
+With that in mind, any incorrect real-time kernel configuration could cause
+a new maximum latency that shows up at the wrong time and is catastrophic
+for the real-time system's latency.
+
+References
+==========
+
+.. [1] See :doc:`/admin-guide/kernel-parameters`
+
+.. _development work: https://lore.kernel.org/r/20260205115559.1625236-1-bigeasy@linutronix.de
+
+.. _Real-Time and Graphics\: A Contradiction?: https://web.archive.org/web/20221025085614/https://linutronix.de/PDF/Realtime_and_graphics-acontradiction2021.pdf
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 0/1] Documentation: Add real-time kernel configuration guide
From: Ahmed S. Darwish @ 2026-04-14 17:41 UTC (permalink / raw)
  To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
  Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
	Derek Barbosa, linux-doc, linux-kernel, Ahmed S. Darwish

Hi,

There is a no "one size fits all" solution for configuring a PREEMPT_RT
kernel.  Intorduce a PREEMPT_RT kernel configuration guide to better help
system developers and integrators.

Changelog v2
------------

Handle Rostedt remarks:

- Better reword certain paragraphs and statements
- Warn about enabling CONFIG_IRQSOFF_TRACER and CONFIG_PREEMPT_TRACER

Handle Wilcox remarks:

- Remove ToC comment + minor rewording

Changelog v1
------------

https://lore.kernel.org/lkml/20260305205023.361530-1-darwi@linutronix.de

Thanks,

8<-----

 Documentation/core-api/real-time/index.rst    |   1 +
 .../real-time/kernel-configuration.rst        | 313 ++++++++++++++++++
 2 files changed, 314 insertions(+)
 create mode 100644 Documentation/core-api/real-time/kernel-configuration.rst

base-commit: 028ef9c96e96197026887c0f092424679298aae8
-- 
2.53.0


^ permalink raw reply

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com>

On Tue, Apr 14, 2026 at 10:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> * I still think there's a good chance we can *significantly* close the
> gap overall between a design with virtual swap and a design without.
> It's a bit premature to commit to a vswap-optional route (which to be
> completely honest I'm still not confident is possible to satisfy all
> of our requirements).

And to further note - these benchmark measure, in effect, purely swap
overhead. In a production environment with a lot of non-swap work, as
long as the gap is close enough I think we would be fine, even for a
hostile case like a fast swapfile-backend (I assume SSD swap's
bottleneck will be the IO mostly).

I will stare at your responses to see if there is other benchmark I
can play with, but it would be very helpful if you can share your full
suite :)

^ permalink raw reply

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:23 UTC (permalink / raw)
  To: Kairui Song
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com>

On Mon, Mar 23, 2026 at 1:05 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > > This patch series is based on 6.19. There are a couple more
> > > > > swap-related changes in mainline that I would need to coordinate
> > > > > with, but I still want to send this out as an update for the
> > > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > > to just build this thing rather than dig through that series of
> > > > > emails to get the fix patch :)
> > > > >
> > > > > Changelog:
> > > > > * v4 -> v5:
> > > > >     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > > >     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > > >       and use guard(rcu) in vswap_cpu_dead
> > > > >       (reported by Peter Zijlstra [17]).
> > > > > * v3 -> v4:
> > > > >     * Fix poor swap free batching behavior to alleviate a regression
> > > > >       (reported by Kairui Song).
> > > >
> > >
> > > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > > the regression in this patch series - we can talk more about
> > > directions in another thread :)

Hi Kairui,

My apologies if I missed your response, but could you share with me
your full benchmark suite? It would be hugely useful, not just for
this series, but for all swap contributions in the future :) We should
do as much homework ourselves as possible :P

And apologies for the delayed response. I kept having to back and
forth between regression investigating, and figuring out what was
going on with the build setups (I missed some of the CONFIGs you had
originally), reducing variance on hosts, etc.

I don't have PMEM, so I have only worked with zram backend so far. I
did manage to reproduce the regressions you showed me (albeit at a
much smaller gap on certain metrics than your cited numbers, which I
suspect is due to zram/pmem difference).

There are two benchmarks that I focused on:

1. Usemem - the exact command I ran is: time ./usemem --init-time -O
-y -x -n 1 56G

My host is 32GB, 52 processor(s) / x86_64.

Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
vss_v5       184.0 +/- 3.9    +4.8%      130.5 +/- 3.8    376,192 +/-
8,581  8,297 +/- 247

(I hope the formatting works, but let me know if it looks weird).

2. Memhog: time memhog 48G

My host for this one is 16 GB, 52 processors, x86_64 too.

Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
vss_v5        83.0 +/- 1.8    +3.1%       65.7 +/- 1.8

On both benchmark, I enable MGLRU, to more closely match the setup you had.

Staring at the run logs (and double check with the logs you sent me to
make sure it's not just on my system), there are some common patterns
I noticed across these runs:

1. Kswapd is slower on the vswap side, which shifts work towards
direct reclaim, and makes compaction have to run harder (which has a
weird contention through zsmalloc - I can expand further, but this is
not vswap-specific, just exacerbated by slower kswapd).

2. Higher swap readahead (albeit with higher hit rate) - this is more
of an artifact of the fact that zero swap pages are no longer backed
by zram swapfile, which skipped readahead in certain paths. We can
ignore this for now, but worth assessing this for fast swap backends
in general (zero swap pages, zswap, so on and so forth).

I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
that I can perf the free stage of usemem separately. Most of the
vswap-specific overhead lies in the xarray lookups. Some big offenders
on top of my mind:

1. Right now, in the physical swap allocator, whenever we have an
allocated slot in the range we're checking, we check if that slot is
swap-cache-only (i.e no swap count), and if so we try to free it (if
swapfile is almost full etc.). This check is cheap if all swap entry
metadata live in physical swap layer only, but more expensive when you
have to go through another layer of indirection :)

I fixed that by just taking one bit in the reverse map to track
swap-cache-only state, which eliminates this without extra space
overhead (on top of the existing design).

2. On the free path, in swap_pte_batch(), we check cgroup to make sure
that the range we pass to free_swap_and_cache_nr() belongs to the same
cgroup, which has a per-PTE overhead for going to the vswap layer. We
can make this check once-per range instead, to reduce overhead. Even
better - we can skip this check in swap_pte_batch() for the free case,
and deferred this check to later on where we already enter vswap
cluster lock context :)

With a bunch of changes like that, I closed the gap majorly:

usemem:
Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
new_opt_v2   179.8 +/- 3.0    +2.4%      126.1 +/- 2.9    382,536 +/-
6,662  7,105 +/- 183

memhog:
Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
new_opt_v2    79.9 +/- 1.7    -0.8%       62.4 +/- 1.7

I would like to also point out that, some of this overhead is specific
to the swapfile backend case, which is why we don't see this in zswap
in the stats I included in V5. Zswap does not require this
swap-cache-only dance, because in virtual swap, zswap only needs the
virtual swap slot as the index (on top of much more negligible space
overhead thanks to zswap tree merging into vswap cluster, no swap
charging, no double allocation, etc.).

Anyway, still a small gap. The next idea that I have is inspired by
TLB, which cache virtual->physical memory address translation. I added
a per-CPU MRU virtual cluster. The idea is that a lot of consecutive
swap operations operate on the same range of swap entries - merging
these operations of course makes the most sense, but sometimes it's
not convenient to do it. The non-vswap, old design sometimes lock the
physical swap cluster and expose the swap cluster struct to callers to
pass around, but I would like to avoid that if possible :)

With this change, we close the gap even further - exceeding the
baseline in average in certain cases, but as you can see it's within
noises so I wouldn't conclude too much out of it:

usemem:
Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
cc_v2        176.4 +/- 5.3    +0.4%      123.6 +/- 5.4    390,405 +/-
12,792 6,987 +/- 296

memhog:
Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
cc_v2         79.9 +/- 0.9    -0.8%       62.1 +/- 1.5

The reclaim and compaction stats tell a similar story:

Reclaim / Compaction (usemem)
Metric                               baseline
vss_v5                   new_opt_v2                        cc_v2
allocstall                 167,787 +/- 10,292           170,532 +/-
15,185           169,782 +/- 9,903            168,635 +/- 13,526
pgsteal_kswapd          6,932,143 +/- 186,411        6,965,962 +/-
288,323        6,968,188 +/- 286,383        7,038,513 +/- 202,696
pgsteal_direct          9,759,350 +/- 480,674        9,978,721 +/-
765,543        9,899,698 +/- 480,781        9,845,668 +/- 544,319
swap_ra                        82.9 +/- 22.6             5994.8 +/-
2817.5            4976.8 +/- 1484.2            4718.2 +/- 1510.5
pgmigrate               1,029,901 +/- 428,416        1,687,072 +/-
399,505        1,260,451 +/- 202,603        1,144,560 +/- 490,177

Reclaim / Compaction (memhog)
Metric                               baseline
vss_v5                   new_opt_v2                        cc_v2
allocstall                 101,245 +/- 6,271            109,320 +/-
12,180           100,207 +/- 11,053            99,223 +/- 9,905
pgsteal_kswapd          8,817,264 +/- 432,519        8,436,548 +/-
265,763        8,728,944 +/- 305,101        8,962,443 +/- 589,012
pgsteal_direct          5,408,046 +/- 394,775        5,932,611 +/-
584,873        5,419,891 +/- 551,226        5,349,352 +/- 601,655
swap_ra                        66.5 +/- 22.8             8589.5 +/-
3325.1            8954.5 +/- 2661.9            8703.1 +/- 1746.6
pgmigrate                  239,410 +/- 46,014           277,193 +/-
71,487           320,672 +/- 59,488          243,989 +/- 136,129

You can see that the latter versions gradually restore the behaviors
of baseline in terms of reclaim dynamics :)

Some final remarks:
* I still think there's a good chance we can *significantly* close the
gap overall between a design with virtual swap and a design without.
It's a bit premature to commit to a vswap-optional route (which to be
completely honest I'm still not confident is possible to satisfy all
of our requirements).

* Regardless of the direction we take, these are all pitfalls that
will be problematic for virtual swap design, and more generally some
of them will affect any dynamic swap design (which has to go through
some sort of indirection or a dynamic data structure like xarray that
will induce some amount of lookup overhead). I hope my work here can
be useful in this sense too, outside of this specific vswap direction
:)

I will clean things up a bit and send you a v6 for further inspection.
Once again, I'd like to express my gratitude for your engagement and
feedback.

^ permalink raw reply

* Re: [PATCH v5] Documentation: Refactored watchdog old doc
From: Guenter Roeck @ 2026-04-14 17:18 UTC (permalink / raw)
  To: Sunny Patel, linux-doc; +Cc: linux-watchdog, linux-kernel, corbet, wim, rdunlap
In-Reply-To: <20260413041215.10362-1-nueralspacetech@gmail.com>

On 4/12/26 21:11, Sunny Patel wrote:
> Mark WDIOC_GETTEMP and WDIOS_TEMPPANIC as deprecated since
> neither is implemented by the watchdog core and both are only
> present in a small number of legacy drivers.
> 
> Add documentation for previously undocumented status bits
> WDIOF_MAGICCLOSE and WDIOF_ALARMONLY in the options field.
> 
> Add documentation for WDIOF_PRETIMEOUT and WDIOF_SETTIMEOUT
> status bits describing their respective ioctls.
> 
> Fix the following issues in existing documentation:
>    - Remove version-specific reference to Linux 2.4.18 from
>      the GETTIMEOUT ioctl description
>    - Fix duplicate "was is" in printf format strings
>    - Replace [FIXME] placeholder with proper descriptions for
>      WDIOS_DISABLECARD, WDIOS_ENABLECARD and WDIOS_TEMPPANIC
> 
> Signed-off-by: Sunny Patel <nueralspacetech@gmail.com>

Reviewed-by: Guenter Roeck <linux@ropeck-us.net>

> ---
> 
> Changes in v5:
>    - Fixed WDIOC_GETTIMELEFT printf statement to correctly reference
>      "timeleft" instead of "timeout".
>    
> Changes in v4:
>    - Fixed WDIOS_DISABLECARD description: corrected inverted logic —
>      the ioctl disables the hardware timer entirely rather than
>      stopping pings. Clarified that userspace, not the kernel driver,
>      is primarily responsible for pinging under normal operation.
> 
>   Documentation/watchdog/watchdog-api.rst | 65 +++++++++++++++++++++----
>   1 file changed, 55 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/watchdog/watchdog-api.rst b/Documentation/watchdog/watchdog-api.rst
> index 78e228c272cf..736436a68f65 100644
> --- a/Documentation/watchdog/watchdog-api.rst
> +++ b/Documentation/watchdog/watchdog-api.rst
> @@ -2,7 +2,7 @@
>   The Linux Watchdog driver API
>   =============================
>   
> -Last reviewed: 10/05/2007
> +Last reviewed: 04/08/2026
>   
>   
>   
> @@ -42,7 +42,7 @@ activates as soon as /dev/watchdog is opened and will reboot unless
>   the watchdog is pinged within a certain time, this time is called the
>   timeout or margin.  The simplest way to ping the watchdog is to write
>   some data to the device.  So a very simple watchdog daemon would look
> -like this source file:  see samples/watchdog/watchdog-simple.c
> +like this source file: see samples/watchdog/watchdog-simple.c
>   
>   A more advanced driver could for example check that a HTTP server is
>   still responding before doing the write call to ping the watchdog.
> @@ -106,11 +106,10 @@ the requested one due to limitation of the hardware::
>   This example might actually print "The timeout was set to 60 seconds"
>   if the device has a granularity of minutes for its timeout.
>   
> -Starting with the Linux 2.4.18 kernel, it is possible to query the
> -current timeout using the GETTIMEOUT ioctl::
> +It is also possible to get the current timeout with the GETTIMEOUT ioctl::
>   
>       ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
> -    printf("The timeout was is %d seconds\n", timeout);
> +    printf("The timeout is %d seconds\n", timeout);
>   
>   Pretimeouts
>   ===========
> @@ -133,7 +132,7 @@ seconds.  Setting a pretimeout to zero disables it.
>   There is also a get function for getting the pretimeout::
>   
>       ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout);
> -    printf("The pretimeout was is %d seconds\n", timeout);
> +    printf("The pretimeout is %d seconds\n", timeout);
>   
>   Not all watchdog drivers will support a pretimeout.
>   
> @@ -145,12 +144,12 @@ before the system will reboot. The WDIOC_GETTIMELEFT is the ioctl
>   that returns the number of seconds before reboot::
>   
>       ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);
> -    printf("The timeout was is %d seconds\n", timeleft);
> +    printf("The timeleft is %d seconds\n", timeleft);
>   
>   Environmental monitoring
>   ========================
>   
> -All watchdog drivers are required return more information about the system,
> +All watchdog drivers are required to return more information about the system,
>   some do temperature, fan and power level monitoring, some can tell you
>   the reason for the last reboot of the system.  The GETSUPPORT ioctl is
>   available to ask what the device can do::
> @@ -227,12 +226,33 @@ The watchdog saw a keepalive ping since it was last queried.
>   	WDIOF_SETTIMEOUT	Can set/get the timeout
>   	================	=======================
>   
> -The watchdog can do pretimeouts.
> +The watchdog supports timeout set/get via the WDIOC_SETTIMEOUT and
> +WDIOC_GETTIMEOUT ioctls.
>   
>   	================	================================
>   	WDIOF_PRETIMEOUT	Pretimeout (in seconds), get/set
>   	================	================================
>   
> +The watchdog supports a pretimeout, a warning interrupt that fires before
> +the actual reboot timeout. Use WDIOC_SETPRETIMEOUT and WDIOC_GETPRETIMEOUT
> +to set/get the pretimeout.
> +
> +	================	================================
> +	WDIOF_MAGICCLOSE	Supports magic close char
> +	================	================================
> +
> +The driver supports the Magic Close feature. The watchdog is only disabled
> +if the character 'V' is written to /dev/watchdog before the file descriptor
> +is closed. Without writing 'V' before closing, the watchdog remains active
> +and will trigger a reboot after the timeout expires.
> +
> +	================	================================
> +	WDIOF_ALARMONLY		Not a reboot watchdog
> +	================	================================
> +
> +The watchdog will not reboot the system when it expires. Instead it
> +triggers a management or other external alarm. Userspace should not
> +rely on a system reboot occurring.
>   
>   For those drivers that return any bits set in the option field, the
>   GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current
> @@ -254,6 +274,11 @@ returned value is the temperature in degrees Fahrenheit::
>       int temperature;
>       ioctl(fd, WDIOC_GETTEMP, &temperature);
>   
> +.. note::
> +	``WDIOC_GETTEMP`` is not implemented by the watchdog core and is
> +	considered deprecated. It is only supported by a small number of
> +	legacy drivers. New drivers should not implement it.
> +
>   Finally the SETOPTIONS ioctl can be used to control some aspects of
>   the cards operation::
>   
> @@ -268,4 +293,24 @@ The following options are available:
>   	WDIOS_TEMPPANIC		Kernel panic on temperature trip
>   	=================	================================
>   
> -[FIXME -- better explanations]
> +``WDIOS_DISABLECARD`` disables the hardware watchdog timer entirely,
> +allowing a controlled system shutdown without triggering a reboot.
> +Userspace is responsible for pinging the watchdog under normal
> +operation; this ioctl stops the underlying hardware timer so that
> +the absence of pings no longer causes a system reset.
> +
> +``WDIOS_ENABLECARD`` starts the watchdog timer. If the watchdog was
> +previously stopped via ``WDIOS_DISABLECARD``, this will re-enable it. The
> +hardware watchdog will begin counting down from the configured timeout.
> +
> +``WDIOS_TEMPPANIC`` enables temperature-based kernel panic. When set,
> +the driver will call ``panic()`` (or ``kernel_power_off()`` on some
> +drivers) if the hardware temperature sensor exceeds its threshold,
> +rather than only setting the ``WDIOF_OVERHEAT`` status bit. Support
> +for this option is driver-specific; not all watchdog drivers implement
> +temperature monitoring.
> +
> +.. note::
> +	``WDIOS_TEMPPANIC`` is not implemented by the watchdog core and is
> +	considered deprecated. It is only present in a small number of
> +	legacy drivers. New drivers should not implement it.


^ permalink raw reply

* Re: [PATCH 4/6] hugetlb: drop vma_hugecache_offset() in favor of linear_page_index()
From: jane.chu @ 2026-04-14 17:14 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, david, muchun.song, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, corbet, skhan, hughd, baolin.wang, peterx,
	linux-mm, linux-doc, linux-kernel
In-Reply-To: <ad4Og_719Yq4yshf@localhost.localdomain>



On 4/14/2026 2:53 AM, Oscar Salvador wrote:
> On Thu, Apr 09, 2026 at 05:41:55PM -0600, Jane Chu wrote:
>> vma_hugecache_offset() converts a hugetlb VMA address into a mapping
>> offset in hugepage units. While the helper is small, its name is not very
>> clear, and the resulting code is harder to follow than using the common MM
>> helper directly.
>>
>> Use linear_page_index() instead, with an explicit conversion from
>> PAGE_SIZE units to hugepage units at each call site, and remove
>> vma_hugecache_offset().
>>
>> This makes the code a bit more direct and avoids a hugetlb-specific helper
>> whose behavior is already expressible with existing MM primitives.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> 
> 
> Looks good to me, the only thing is the conversion to hugepage units
> which may not be very clear to the casual reader, but you already
> mentioned that you will add a helper, so all good.
> 
>   
Yes, will do.

thanks!
-jane
> 


^ permalink raw reply

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Kiryl Shutsemau @ 2026-04-14 17:10 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Peter Xu, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm
In-Reply-To: <55019037-4f1c-4d9c-83ee-3a844d8f3d5e@kernel.org>

On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote:
> On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> > 
> > == Problem ==
> > 
> > VMMs managing guest memory need to:
> > 1. Track which pages are actively used (working set detection)
> > 2. Safely evict cold pages to slower storage
> > 3. Fetch pages back on demand when accessed again
> > 
> > For shmem-backed guest memory, working set tracking partially works
> > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
> > re-access auto-resolves from cache. But safe eviction still requires
> > synchronous fault interception to prevent data loss races.
> > 
> > For anonymous guest memory (needed for KSM cross-VM deduplication),
> > there is no mechanism at all — clearing a PTE loses the page.
> > 
> > == Solution ==
> > 
> > The series introduces a unified userfaultfd interface that works
> > across both anonymous and shmem-backed memory:
> > 
> > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
> > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
> > balancing) to make pages inaccessible without freeing them.
> 
> I would rather tackle this from the other direction: it's another form
> of protection (like WP), not really a "minor" mode.
> 
> Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it)
> and support it for anon+shmem, avoiding the zapping for shmem completely?

I like this idea.

It should be functionally equivalent, but your interface idea fits
better with the rest.

Thanks! Will give it a try.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Kiryl Shutsemau @ 2026-04-14 17:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	James Houghton, Andrea Arcangeli
In-Reply-To: <ad5dIUpAMs4MuBvV@x1.local>

On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote:
> Hi, Kiryl,
> 
> On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> 
> Thanks for sharing this work, it looks very interesting to me.
> 
> Personally I am also looking at some kind of VMM memtiering issues.  I'm
> not sure if you saw my lsfmm proposal, it mentioned the challenge we're
> facing, it's slightly different but still a bit relevant:
> 
> https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/

Thanks will read up. I didn't follow userfultfd work until recently.

> Unfortunately, that proposal was rejected upstream.

Sorry about that. We can chat about in hall track, if you are there :)

> > == VMM Workflow ==
> 
> AFAIU, this workflow provides two functionalities:
> 
> > 
> >     UFFDIO_DEACTIVATE(all)            -- async, no vCPU stalls
> >     sleep(interval)
> >     PAGEMAP_SCAN                      -- find cold pages
> 
> Until here it's only about page hotness tracking.  I am curious whether you
> evaluated idle page tracking.  Is it because of perf overheads on rmap?

I didn't gave idle page tracking much thought. I needed uffd faults to
serialize reclaim against memory accesses. If use it for one thing we
can as well try to use it for tracking as well. And it seems to be
fitting together nicely with sync/async mode flipping.

> To
> me, your solution (until here.. on the hotness sampling) reads more like a
> more efficient way to do idle page tracking but only per-mm, not per-folio.
> 
> That will also be something I would like to benefit if QEMU will decide to
> do full userspace swap.  I think that's our last resort, I'll likely start
> with something that makes QEMU work together with Linux on swapping
> (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
> currently uses, as long as efficient) then QEMU only cares about the rest,
> which is what the migration problem is about.
> 
> The other issue about idle page tracking to us is, I believe MGLRU
> currently doesn't work well with it (due to ignoring IDLE bits) where the
> old LRU algo works.  I'm not sure how much you evaluated above, so it'll be
> great to share from that perspective too.  I also mentioned some of these
> challenges in the lsfmm proposal link above.
> 
> >     UFFDIO_SET_MODE(sync)             -- block faults for eviction
> >     pwrite + MADV_DONTNEED cold pages -- safe, faults block
> >     UFFDIO_SET_MODE(async)            -- resume tracking
> 
> These operations are the 2nd function.  It's, IMHO, a full userspace swap
> system based on userfaultfd.

Right. And we want to decide where to put cold pages from userspace. 

> Have you thought about directly relying on userfaultfd-wp to do this work?
> The relevant question is, why do we need to block guest reads on pages
> being evicted by the userapp?  Can we still allow that to happen, which
> seems to be more efficient?  IIUC, only writes / updates matters in such
> swap system.

But we do care about about read accesses. We don't want to swap out
pages that got read-touched. And we cannot in practice switch to WP mode
after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls
with TLB flushing each.

With my approach switching tracking and reclaiming is single bit flip
under mmap lock.

> Also, I'm not sure if you're aware of LLNL's umap library:
> 
> https://github.com/llnl/umap
> 
> That implemnted the swap system using userfaultfd wr-protect mode only, so
> no new kernel API needed.

Will look into it. Thanks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [GIT PULL] Documentation for 7.1
From: pr-tracker-bot @ 2026-04-14 16:46 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Linus Torvalds, linux-kernel, linux-doc, Shuah Khan
In-Reply-To: <87bjfnzw2x.fsf@trenco.lwn.net>

The pull request you sent on Sun, 12 Apr 2026 15:51:18 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux.git tags/docs-7.1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5181afcdf99527dd92a88f80fc4d0d8013e1b510

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 16:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: YoungJun Park, Liam.Howlett, akpm, apopple, axelrasmussen, baohua,
	baolin.wang, bhe, byungchul, cgroups, chengming.zhou, chrisl,
	corbet, david, dev.jain, gourry, hannes, hughd, jannh,
	joshua.hahnjy, lance.yang, lenb, linux-doc, linux-kernel,
	linux-mm, linux-pm, lorenzo.stoakes, matthew.brost, mhocko,
	muchun.song, npache, pavel, peterx, peterz, pfalcato, rafael,
	rakie.kim, roman.gushchin, rppt, ryan.roberts, shakeel.butt,
	shikemeng, surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed,
	yuanchu, zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAMgjq7BO6SLZPfNXDh1F-7RAOqDAfqMQ4PM=qjAq1mCsWyD0LQ@mail.gmail.com>

On Mon, Apr 13, 2026 at 8:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 11:05 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
>
> Hi All,
>
> > On Sat, Apr 11, 2026 at 06:40:44PM -0700, Nhat Pham wrote:
> > > > 1. Modularization
> > > >
> > > > You removed CONFIG_* and went with a unified approach. I recall
> > > > you were also considering a module-based structure at some point.
> > > > What are your thoughts on that direction?
> > > >
> > >
> > > The CONFIG-based approach was a huge mess. It makes me not want to
> > > look at the code, and I'm the author :)
> > >
> > > > If we take that approach, we could extend the recent swap ops
> > > > patchset (https://lore.kernel.org/linux-mm/20260302104016.163542-1-bhe@redhat.com/)
> > > > as follows:
> > > > - Make vswap a swap module
> > > > - Have cluster allocation functions reside in swapops
> > > > - Enable vswap through swapon
> > >
> > > Hmmmmm.
> >
> > I think this would be a happy world, but I wonder what others think.
> > Anyway, I'm looking forward to the future direction.
> >
>
> Yeah, I agree with this.
>
> And I do think swapoff of the virtual space itself is also necessary,
> we really need a failsafe, e.g. a clean way to drop the swap
> cache and data, kind of like drop_caches or shrinker fs are
> commonly used.
>
> > > > 2. Flash-friendly swap integration (for my use case)
> > > >
> > > > I've been thinking about the flash-friendly swap concept that
> > > > I mentioned before and recently proposed:
> > > > (https://lore.kernel.org/linux-mm/aZW0voL4MmnMQlaR@yjaykim-PowerEdge-T330/)
> > > >
> > > > One of its core functions requires buffering RAM-swapped pages
> > > > and writing them sequentially at an appropriate time -- not
> > > > immediately, but in proper block-sized units, sequentially.
> > > >
> > > > This means allocated offsets must essentially be virtual, and
> > > > physical offsets need to be managed separately at the actual
> > > > write time.
> > > >
> > > > If we integrate this into the current vswap, we would either
> > > > need vswap itself to handle the sequential writes (bypassing
> > > > the physical device and receiving pages directly), or swapon
> > > > a swap device and have vswap obtain physical offsets from it.
> > > > But since those offsets cannot be used directly (due to
> > > > buffering and sequential write requirements), they become
> > > > virtual too, resulting in:
> > > >
> > > >   virtual -> virtual -> physical
> > > >
> > > > This triple indirection is not ideal.
> > > >
> > > > However, if the modularization from point 1 is achieved and
> > > > vswap acts as a swap device itself, then we can cleanly
> > > > establish a:
> > > >
> > > >   virtual -> physical
> > >
> > > I read that thread sometimes ago. Some remarks:
> > >
> > > 1. I think Christoph has a point. Seems like some of your ideas ( are
> > > broadly applicable to swap in general. Maybe fixing swap infra
> > > generally would make a lot of sense?
> >
> > Broadly speaking, there are two main ideas:
> > 1. Swap I/O buffering (which is also tied to cluster management issues)
> > 2. Deduplication
> >
> > Are you leaning towards the view that these two should be placed in a
> > higher layer?
>
> IMHO the swap infra should be doing less, not more, so we can have
> more flexible design, and different backends can implement their own
> way to manage the data and layer. e.g. Having one backend being
> flash friendly and it can do this without caring or affecting other devices
> or backends.

I think that's what Youngjun already has, unless I misunderstand his
descriptions.

>
> > If it goes into ZSWAP, there would definitely be a clear advantage of
> > seeing dedup benefits across all swap devices. It's a technically
> > interesting area, and I'd like to discuss it in a separate thread if
> > I have more ideas or thoughts.
>
> Just branstorm... Why don't we just merge these identical pages like
> KSM? Maybe at least zero folios might benefit a lot if we keep them
> mapped as RO instead of recording them in swap, seems better in the
> long term?

That's our preferred approach too. We just didn't manage to get that
to work (yet). :)

^ permalink raw reply

* [PATCH] docs: proc.rst: update description of VmallocUsed and VmallocChunk
From: Herve Vico @ 2026-04-14 15:38 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan
  Cc: Herve Vico, linux-kernel, linux-fsdevel, linux-doc

A long time ago the behavior of two /proc/meminfo Vmalloc<...> counters
has been modified twice without updating the doc:

- v4.4 removes the expensive 'vmalloc_info' bookkeeping behind VmallocUsed
  and VmallocChunk, and makes both counters return zero:

  commit a5ad88ce8c7f ("mm: get rid of 'vmalloc_info' from /proc/meminfo")

-  v5.3 reintroduces VmallocUsed, making it now report the physical memory
  allocated by vmalloc() calls rather than the vmalloc VA space:

  commit 97105f0ab7b8 ("mm: vmalloc: show number of vmalloc pages in
                        /proc/meminfo")

Let's update the doc to reflect the current behavior.

Signed-off-by: Herve Vico <herve.vico@sipearl.com>
---
 Documentation/filesystems/proc.rst | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index b0c0d1b45b99..4a36ac960417 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1233,9 +1233,12 @@ Committed_AS
 VmallocTotal
               total size of vmalloc virtual address space
 VmallocUsed
-              amount of vmalloc area which is used
+              Amount of memory allocated by vmalloc() calls.
+              Note that VmallocTotal is a constant that refers to the size of
+              the vmalloc VA space, while VmallocUsed reports the amount of
+              memory allocated by vmalloc() calls.
 VmallocChunk
-              largest contiguous block of vmalloc area which is free
+              Deprecated, hardcoded to zero.
 Percpu
               Memory allocated to the percpu allocator used to back percpu
               allocations. This stat excludes the cost of metadata.

base-commit: d60bc140158342716e13ff0f8aa65642f43ba053
-- 
2.51.2


^ permalink raw reply related

* htmldocs: Documentation/core-api/mm-api:104: ./include/linux/mm_inline.h:577: WARNING: Inline emphasis start-string without end-string. [docutils]
From: kernel test robot @ 2026-04-14 15:52 UTC (permalink / raw)
  To: Dev Jain; +Cc: oe-kbuild-all, 0day robot, linux-doc

Hi Dev,

FYI, the error/warning was bisected to this commit, please ignore it if it's irrelevant.

tree:   https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-initialize-nr_pages-to-1-at-loop-start-in-try_to_unmap_one/20260414-033035
head:   77dacfde3a6afac7fa3c015671d2452b524b37ad
commit: 1202b576c2e876c8cab1c41f1816a3c15bdf79d3 mm/memory: Batch set uffd-wp markers during zapping
date:   20 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260414/202604141701.JY7pVkau-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604141701.JY7pVkau-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Documentation/core-api/kref:328: ./include/linux/kref.h:94: WARNING: Invalid C declaration: Expected end of definition. [error at 92]
   int kref_put_lock (struct kref *kref, void (*release)(struct kref *kref), spinlock_t *lock) __cond_acquires(true# lock)
   --------------------------------------------------------------------------------------------^
   WARNING: ./include/linux/mm_inline.h:591 function parameter 'pte' not described in 'install_uffd_wp_ptes_if_needed'
   WARNING: ./include/linux/mm_inline.h:591 function parameter 'pte' not described in 'install_uffd_wp_ptes_if_needed'
>> Documentation/core-api/mm-api:104: ./include/linux/mm_inline.h:577: WARNING: Inline emphasis start-string without end-string. [docutils]
   WARNING: ./include/crypto/skcipher.h:166 struct member 'SKCIPHER_ALG_COMMON' not described in 'skcipher_alg'
   Documentation/driver-api/basics:42: ./kernel/time/time.c:370: WARNING: Duplicate C declaration, also defined at driver-api/basics:440.
   Declaration is '.. c:function:: unsigned int jiffies_to_msecs (const unsigned long j)'. [duplicate_declaration.c]
   Documentation/driver-api/basics:42: ./kernel/time/time.c:393: WARNING: Duplicate C declaration, also defined at driver-api/basics:457.
   Declaration is '.. c:function:: unsigned int jiffies_to_usecs (const unsigned long j)'. [duplicate_declaration.c]

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: mkf @ 2026-04-14 15:37 UTC (permalink / raw)
  To: Jiayuan Chen, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
> 
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
> 
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
> 	-> tcp_push_pending_frames()
> 	  -> tcp_current_mss()
> 		-> tcp_established_options()
> 		  -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
> 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> 
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst
> b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>  	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>  					 * values defined in uapi/linux/tcp.h
>  					 */
> -	u8	bpf_chg_cc_inprogress:1; /* In the middle of
> +	u8	bpf_chg_cc_inprogress:1, /* In the middle of
>  					  * bpf_setsockopt(TCP_CONGESTION),
>  					  * it is to avoid the bpf_tcp_cc->init()
>  					  * to recur itself by calling
>  					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>  					  */
> +		bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +						  * callback so that a nested
> +						  * bpf_setsockopt(TCP_NODELAY) or
> +						  * bpf_setsockopt(TCP_CORK) cannot
> +						  * trigger tcp_push_pending_frames(),
> +						  * which would call tcp_current_mss()
> +						  * -> bpf_skops_hdr_opt_len(), causing
> +						  * infinite recursion.
> +						  */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 78b548158fb0..518699429a7a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
>  	if (sk->sk_protocol != IPPROTO_TCP)
>  		return -EINVAL;
>  
> +	if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
> +	    tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
> +		return -EBUSY;
> +
TCP_CORK is not support in sol_tcp_sockopt(), return -EINVAL by default. and put the check here
could also prevent us from calling getsockopt(TCP_NODELAY) below.

>  	switch (optname) {
>  	case TCP_NODELAY:
>  	case TCP_MAXSEG:
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index dafb63b923d0..fb06c464ac16 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>  	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
>  
>  	newtp->bpf_chg_cc_inprogress = 0;
> +	newtp->bpf_hdr_opt_len_cb_inprogress = 0;
>  	tcp_bpf_clone(sk, newsk);
>  
>  	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118..c9654e690e1a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  				  unsigned int *remaining)
>  {
>  	struct bpf_sock_ops_kern sock_ops;
> +	struct tcp_sock *tp = tcp_sk(sk);
>  	int err;
>  
>  	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> @@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  	if (skb)
>  		bpf_skops_init_skb(&sock_ops, skb, 0);
>  
> +	tp->bpf_hdr_opt_len_cb_inprogress = 1;
we check the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG before calling BPF_CGROUP_RUN_PROG_SOCK_OPS_SK,
could this flag use for the same purpose? so we don't need to add an extra field.

	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
	    !*remaining)
		return;
>  	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> +	tp->bpf_hdr_opt_len_cb_inprogress = 0;
>  
>  	if (err || sock_ops.remaining_opt_len == *remaining)
>  		return;

-- 
Thanks,
KaFai


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox