Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v5 1/2] serial: qcom-geni: trace: Drop redundant len field from geni_serial_data
From: Konrad Dybcio @ 2026-06-18  8:55 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Greg Kroah-Hartman, Jiri Slaby
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <20260615-add-tracepoints-for-qcom-geni-serial-v5-1-2efa4c97e0e2@oss.qualcomm.com>

On 6/15/26 4:16 PM, Praveen Talari wrote:
> The dynamic array stored in the ring buffer already carries its own
> length in the array metadata. There is no need to also store it as a
> separate scalar field in the entry struct.
> 
> Drop __field(unsigned int, len) and the corresponding __entry->len
> assignment, and use __get_dynamic_array_len(data) in the TP_printk for
> both the len=%u format argument and the __print_hex() size argument.
> This saves 4 bytes per event on the ring buffer.
> 
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply

* [PATCH] tracing/probes: Remove WARN_ON_ONCE from parse_btf_arg
From: Masami Hiramatsu (Google) @ 2026-06-18  8:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Sashiko found that user can cause this WARN_ON_ONCE() easily
with adding a kprobe event based on a raw address with BTF
parameter.

Since this is not an unexpected condition, remove the
WARN_ON_ONCE().

Link: https://sashiko.dev/#/patchset/178165816303.269421.7302603996990753309.stgit%40devnote2

Reported-by: Sashiko <sashiko-bot@kernel.org>
Fixes: b576e09701c7 ("tracing/probes: Support function parameters if BTF is available")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/trace_probe.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..98532c503d02 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -678,7 +678,7 @@ static int parse_btf_arg(char *varname,
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
+	if (!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);


^ permalink raw reply related

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-18  8:31 UTC (permalink / raw)
  To: Gregory Price, Brendan Jackman
  Cc: Vlastimil Babka (SUSE), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <ajFT235iYsSJ7nbR@gourry-fedora-PF4VCD3F>

On 6/16/26 15:47, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 11:57:42AM +0000, Brendan Jackman wrote:
>> On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
>>>
>>> I think the memalloc approach is dangerous due to unexpected nesting. There
>>> might be nested page allocations in page allocation itself (due to some
>>> debugging option). But also interrupts do not change what "current" points
>>> to. Suddenly those could start requesting folios and/or private nodes and be
>>> surprised, I'm afraid.
>>
>> Minor side-note: couldn't we just define it such that the allocator
>> ignores the context when not in_task() (and warn if you try to enter the
>> context while not currently in_task())?
>>
>> (Don't think this would change the conclusion very much, e.g. doesn't
>> help with the nesting issues. Mostly curious in case I'm missing a
>> detail here).
>>

So I took a look at which nested allocations we could end up having, and I
wonder whether gfp_nested_mask() indicates all these?

If we could reliably identify them, all we'd have to do is safe+restore some
context (activating a "nested" context).

> 
> I looked at this - only solves one issue and oh boy is that an obtuse
> confusing condition to understand.  We still suffer from recursion in
> reclaim.

Right, we'd have to clear the context before calling into reclaim/compaction
that does weird things.

I'm sure BPF hooks could just arbitrarily try to allocate pages with
kmalloc_nolock(). So that would require a context save/restore as well.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Vlastimil Babka (SUSE) @ 2026-06-18  8:30 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Shakeel Butt
  Cc: JP Kobryn, linux-mm, willy, usama.arif, akpm, mhocko, rostedt,
	mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <3149bc84-dd3a-43ba-826e-6364965fdafd@kernel.org>

On 6/18/26 10:21, David Hildenbrand (Arm) wrote:
> On 6/17/26 20:18, Vlastimil Babka (SUSE) wrote:
>> On 6/17/26 17:03, Shakeel Butt wrote:
>>> On Wed, Jun 17, 2026 at 01:11:16PM +0200, David Hildenbrand (Arm) wrote:
>>>>
>>>> Given that trace events can quickly become stable ABI [1], are we really sure we
>>>> want to add this?
>>>
>>> Yes, I think so as this is useful to get insights into lru cache draining.
>>> Trace events being stable or not is secondary IMHO. If in future we rearchitect
>>> the lru page handling where there is no cache draining anymore, we can make
>>> these a noops.
>> 
>> Yeah and I don't recall ever that a change to a mm tracepoint would ever
>> break someone who'd complain and we'd have to revert it.
> Really? :)
> 
> Read the context of the link I posted once more.

Ah, I see. I've only read the single mail from Steven that referred to the
old powertop breakage and didn't notice the context.

But I don't think these worries should stop us from adding easily usable
tracepoints.

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-18  8:21 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <ajAcIwBAnqgEEWSD@gourry-fedora-PF4VCD3F>

On 6/15/26 17:37, Gregory Price wrote:
> On Mon, Jun 15, 2026 at 05:18:55PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>> > 
>> > I think the memalloc approach is dangerous due to unexpected nesting. There
>> > might be nested page allocations in page allocation itself (due to some
>> > debugging option). But also interrupts do not change what "current" points
>> > to. Suddenly those could start requesting folios and/or private nodes and be
>> > surprised, I'm afraid.
>> 
>> Yeah, we'd need some way to distinguish the main allocation from these other
>> (nested) allocations.
>>
>> 
>> > 
>> > The memalloc scopes only work well when they restrict the context wrt
>> > reclaim, and allocations in IRQ have to be already restricted heavily
>> > (atomic) so further memalloc restrictions don't do anything in practice. But
>> > to make them change other aspects of the allocations like this won't work.
>> 
>> I was assuming that memalloc_pin_save() would already violate that, but really
>> it only restricts where movable allocations land, and that doesn't matter for
>> other kernel allocations.
>> 
>> Do you see any other way to make something like an allocation context work, and
>> avoid introducing more GFP flags?
>>
> 
> One thought would be a way to switch what fallback list is used, and
> then have specific fallback lists for certain contexts.
> 
> Right now there is a single example of this: __GFP_THISNODE
>   |= __GFP_THISNODE   =>  NOFALLBACK
>   &= ~__GFP_THISNODE  =>  FALLBACK
> 
> We could add an interface with the desired fallback list based as an
> argument, and let get_page_from_freelist to prefer that over the default
> global lists.

Does it mean a new argument in a number of functions in the page allocator,
or can it be mapped to alloc_flags (at least internally?), because the
number of possible fallback lists is small enough?

> Omit all special nodes from FALLBACK/NOFALLBACK and make the special
> contexts provide the fallback-base that should be used.
> 
> On my current branch i think that would include modifying, in totality:
> 
>    alloc_folio_mpol()
>    alloc_demotion_folio()
>    alloc_migration_target()
> 
> And i'm pretty sure that all just nests nicely.
> 
> We might not even need memalloc... hmmm
> 
> ~Gregory


^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: David Hildenbrand (Arm) @ 2026-06-18  8:21 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Shakeel Butt
  Cc: JP Kobryn, linux-mm, willy, usama.arif, akpm, mhocko, rostedt,
	mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <1136baf3-3967-4202-9eaa-5fd667c235cf@kernel.org>

On 6/17/26 20:18, Vlastimil Babka (SUSE) wrote:
> On 6/17/26 17:03, Shakeel Butt wrote:
>> On Wed, Jun 17, 2026 at 01:11:16PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> Given that trace events can quickly become stable ABI [1], are we really sure we
>>> want to add this?
>>
>> Yes, I think so as this is useful to get insights into lru cache draining.
>> Trace events being stable or not is secondary IMHO. If in future we rearchitect
>> the lru page handling where there is no cache draining anymore, we can make
>> these a noops.
> 
> Yeah and I don't recall ever that a change to a mm tracepoint would ever
> break someone who'd complain and we'd have to revert it.
Really? :)

Read the context of the link I posted once more.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3 09/13] verification/rvgen: Delete __parse_constraint()
From: Gabriele Monaco @ 2026-06-18  8:13 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <87tsr1mqrj.fsf@yellow.woof>

On Wed, 2026-06-17 at 11:59 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > This function used to validate things we are no longer validating,
> > now it's
> > alright to create a model where a clock is never reset, which
> > doesn't fully
> > make sense. Should we add that check somewhere else?
> 
> Theory does not require clock reset, right?

Yeah, I don't see it explicitly mandated in the theory, but the
description (from the sources) states:

  The value of a clock thus denotes the amount of time that has been  
elapsed since its last reset

But it also says (emphasis added by me):

  Clocks /can/ be reset to zero after which they start increasing ...

Nowhere it says clocks /must/ be reset, their value simply won't make
sense (according to the definition).

Now in our implementation we may have some automatic reset when the
monitor starts (I'm planning that to avoid invalid states), which could
make explicit resets superfluous in some cases.

Let's leave that to the user for now and skip this check.

Thanks,
Gabriele

>  This is not some sort of
> hidden issue that trips up unsuspecting people. It is obvious from
> the
> model that the clock is never reset. So I think it's fine to allow
> people to do that, maybe there will be an actual useful model without
> clock reset, you never know.
> 
> The self.env_types check is enforced by the grammar. We do lose the
> self.env_types check, but that is likely redundant anyway because we
> have this:
> 
>         for transition in self.transitions:
>             [...]
>             if transition.reset:
>                 envs.append(transition.reset.env)
>                 self.env_stored.add(transition.reset.env)
> 
> so it is clear that all envs that are reset do have a storage.
> 
> That said, I am fine with keeping these sanity checks, if you are
> paranoid.
> 
> Nam


^ permalink raw reply

* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-18  1:52 UTC (permalink / raw)
  To: Martin Kaiser; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <ajJbkeK0zXb8MtcS@akranes.kaiser.cx>

On Wed, 17 Jun 2026 10:32:17 +0200
Martin Kaiser <martin@kaiser.cx> wrote:

> Hiramatsu-san,
> 
> thank you for reviewing my patch.
> 
> Thus wrote Masami Hiramatsu (mhiramat@kernel.org):
> 
> > Ah, this is a bit complicated. It seems to work with sched_switch event
> > as commit f04dec93466a ("tracing/eprobes: Fix reading of string fields"):
> 
> > echo 'e:sw sched/sched_switch comm=$next_comm:string' > dynamic_events
> 
> > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > #              | |         |   |||||     |         |
> >               sh-162     [002] d..3.    54.027213: sw: (sched.sched_switch) comm="swapper/2"
> >           <idle>-0       [007] d..3.    54.034573: sw: (sched.sched_switch) comm="rcu_preempt"
> >      rcu_preempt-15      [007] d..3.    54.034589: sw: (sched.sched_switch) comm="swapper/7"
> 
> > Maybe comm is stored as a fixed string information in the event record?
> 
> Yes, this example does not execute my change.
> 
> > /sys/kernel/tracing # cat events/sched/sched_switch/format 
> > name: sched_switch
> > ID: 254
> > format:
> > 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
> > 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
> > 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
> > 	field:int common_pid;	offset:4;	size:4;	signed:1;
> 
> > 	field:char prev_comm[16];	offset:8;	size:16;	signed:0;
> > 	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
> > 	field:int prev_prio;	offset:28;	size:4;	signed:1;
> > 	field:long prev_state;	offset:32;	size:8;	signed:1;
> > 	field:char next_comm[16];	offset:40;	size:16;	signed:0;
> > 	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
> > 	field:int next_prio;	offset:60;	size:4;	signed:1;
> 
> > But the filename is a pointer.
> 
> > /sys/kernel/tracing # cat events/syscalls/sys_enter_openat/format 
> > name: sys_enter_openat
> > ID: 705
> > format:
> > 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
> > 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
> > 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
> > 	field:int common_pid;	offset:4;	size:4;	signed:1;
> 
> > 	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
> > 	field:int dfd;	offset:16;	size:8;	signed:0;
> > 	field:const char * filename;	offset:24;	size:8;	signed:0;
> > 	field:int flags;	offset:32;	size:8;	signed:0;
> > 	field:umode_t mode;	offset:40;	size:8;	signed:0;
> > 	field:__data_loc char[] __filename_val;	offset:48;	size:4;	signed:0;
> 
> > In this case, the filename field should use __data_loc directly instead of
> > pointing data on the ring buffer.
> 
> > Can you try 
> 
> > echo 'e syscalls.sys_enter_openat $__filename_val:string' > \
> >  		/sys/kernel/tracing/dynamic_events
> 
> > Instead?
> 
> This field is working as expected.
> 
> I still believe that the handling of FILTER_PTR_STRING is not correct. The
> pointer is stored in the ringbuffer as unsigned long and read as a char. This
> gives us a truncated pointer that cannot be dereferenced.

Ah, OK. I understand the problem.

 - ring buffer and its records should be self-contained.
 - In most cases, events use __data_loc/__rel_loc or fixed array to store
   strings.
 - only syscall events exposes the char *, which is not recommended but
   important to debug user space. (not for dereference)

The example usage of FILTER_PTR_STRING is actually using FILTER_STATIC_STRING
now, so FILTER_PTR_STRING is left broken. (hmm, but there are many
 "const char *" are used especially under rcu events...)

OK, can you update your patch description to use rcu events?

BTW, I think those also should be decoded from enum value in the events,
or use __rel_loc. Since it is not self-contained. (it's a TODO item)

> > I think better solution is fixing sycall tracer.
> 
> I would say that syscall trace is doing the right thing. The ringbuffer entry
> is a struct syscall_trace_enter, the syscall arguments are unsigned longs.
> They are written in ftrace_syscall_enter, this looks correct to me.

OK, I thought the filename points the ringbuffer, but it actually points
the user space. (saving a raw parameter values) So it is OK.

For eprobe users, it should not access to the user space data directly
because it can cause page fault in the kernel without fixup. It may work
on x86, but it doesn't work on other architecture which has separated
address space for user space. To avoid such mistake, it saves actual
string in the ringbuffer as __filename_val.

Hmm, this must be documented in eprobe example code...

> 
> A const char * syscall argument is using FILTER_PTR_STRING, the unsigned long
> argument from the ringbuffer is read as a char and then converted to a
> truncated pointer.


Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v4 6/7] Documentation: bootconfig: document build-time cmdline rendering
From: Masami Hiramatsu @ 2026-06-18  0:47 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <ajJu2KlfVyuUH-VA@gmail.com>

On Wed, 17 Jun 2026 02:56:23 -0700
Breno Leitao <leitao@debian.org> wrote:

> On Wed, Jun 10, 2026 at 07:58:10AM -0700, Breno Leitao wrote:
> > On Wed, Jun 10, 2026 at 11:37:20PM +0900, Masami Hiramatsu wrote:
> > > To avoid confusion, when this option is used, shouldn't we treat it
> > > the same way as if embedded command lines were enabled, and either
> > > not display it in /proc/bootconfig (or always display it, by merging
> > > the rendered string)?
> > 
> > You're right that EMBED_CMDLINE breaks it: the embedded kernel.* keys
> > are already in boot_command_line before setup_boot_config() ever sees
> > the initrd bconf, so a user reading /proc/bootconfig would see only
> > the initrd keys while parse_early_param() acted on the embedded ones.
> > That's exactly the split-state Sashiko was circling around.
> > 
> > Both options you suggest work for me, but they pull in opposite
> > directions and I'd rather not guess wrong on the user-facing
> > contract.  Which do you prefer for v5?
> > 
> >   (a) Don't display embedded in /proc/bootconfig -- keep the current
> >       "file shows the active bootconfig source" behavior and document
> >       that with EMBED_CMDLINE=y, the kernel.* subtree may have been
> >       applied separately via the cmdline.
> > 
> >   (b) Always display embedded by merging the rendered string into
> >       /proc/bootconfig when EMBED_CMDLINE=y, so the file reflects
> >       what was actually applied.
> > 
> > Happy to go either way
> 
> Following up on my own mail rather than leaving it fully open: after
> looking at the code more, I'd like to recommend (a).

Agreed. Sorry for replying late.

> 
> The deciding factor is ordering. EMBED_CMDLINE only works because the
> embedded "kernel" keys are folded into boot_command_line in
> setup_arch(), before parse_early_param() -- which is long before
> setup_boot_config() looks at the initrd.

Yes. Unless doing setup_arch() we can not get initrd image, this means
we don't know whether there is bootconfig or not at that point.

> 
> So for early params the embedded values are necessarily applied first, and an
> initrd bootconfig cannot override them no matter how we present
> /proc/bootconfig. That makes the embedded cmdline behave like a build-time
> CONFIG_CMDLINE rather than a bootconfig source, and (a) is the option that
> describes it honestly: it shows in /proc/cmdline, and /proc/bootconfig keeps
> meaning "the bootconfig tree that was parsed".

Indeed. So I think this EMBED_CMDLINE is more like CMDLINE set by bootconfig
file, instead of embedded string. That is useful for reusing the boot options.
We need to change the explanation and clarify it.

Thus we should those configs mutual exclusive. If user already sets the
CONFIG_CMDLINE, EMBED_CMDLINE should not be enabled.

But actually, there is another options we need to mention:

- CONFIG_CMDLINE: default cmdline, could be ignored if bootloader passes
   a cmdline string.
- CONFIG_CMDLINE_FORCE: ignore the other cmdline. (but bootconfig can
  overwrite it, hmm)
- CONFIG_CMDLINE_EXTEND: append the embedded cmdline string to bootloader
  cmdline. (similar to bootconfig current behavior)

- CONFIG_BOOT_CONFIG_EMBED: just an embedded bootconfig. extends the
  existing cmdline, but does not support early parameters. This is ignored
  if user passed bootconfig via initrd.

- CONFIG_BOOT_CONFIG_EMBED_CMDLINE: replacing CONFIG_CMDLINE with bootconfig
  but it will not shown in /proc/bootconfig.

So you can see CONFIG_BOOT_CONFIG_EMBED_CMDLINE is a bit special.
I think it maybe natual that we call it CONFIG_CMDLINE_BOOT_CONFIG.
In this case, we render the cmdline string from bootconfig build-time
and set CONFIG_CMDLINE with the rendered cmdline string.

> 
> (a) is also what the tree already does -- saved_boot_config is built
> only from the XBC tree, the rendered string never enters it -- so it is
> no new code on the /proc side and keeps the series small.

Agreed.

> 
> (b) would pull the flattened cmdline string back into the structured
> tree view and need dedup against the initrd keys, which muddies what
> /proc/bootconfig means for little gain.

I would like to avoid such complexity, just keep it simple as possible.

> 
> So unless you'd rather have (b), I'll take (a) for v5 and extend
> bootconfig.rst to cover the four sources (bootloader cmdline, embedded
> cmdline, initrd bootconfig, embedded bootconfig).

Yes, I agree with you.

> 
> I'll also document the sharp edge -- with both an embedded cmdline and an
> initrd bootconfig, early params reflect the embedded values because the initrd
> is not parsed yet.

My recommendation is to give simpler mind model to users. If it is simply
extend the CONFIG_CMDLINE which can be described by bootconfig file,
that is more managable outside of kernel configuration.

Or, you would like to access cmdline setting via /proc/bootconfig?
In this case, the problem is a bit more limitation of bootconfig side.

Since the kernel cmdline accepts any contradictory settings, if "foo=A foo=B"
are passed, bootconfig will make an error because foo has 2 different
settings.
Typically, this is represented as an array in bootconfig.

 foo = A, B;

But if cmdline bootconfig says:

 foo = A;

and initrd bootconfig says:

 foo := B;

":=" means overriding the previous settings. Thus a contradiction
arises between these two, when rendering /proc/bootconfig. It can not
show 2 different settings for the same key. (it is possible if we
render it twice, but /proc/bootconfig user may not expect it.)

I think it's fine to represent it as an array (foo = A, B) if this
ENBED_CMDLINE is set, but it still seems risky if early parameters
aren't detected. If `early_param = A` is set in the embedded
bootconfig, and accidentally initrd bootconfig sets `early_param = B`
we should ignore latter one (with warning). But maybe it is another
story.

I think we can proceed it without rendering it in /proc/bootconfig
at this point. And later we find the way to detect early parameters
correctly, we can fix it.

(BTW, early parameter problem is a bit complicated. It is not hard
to distinguish early parameters, but kernel accepts the same key
for early parameter and normal parameter. e.g. "console=")

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 2/2] selftests/ftrace: Account for 8-byte aligned trace_marker_raw events
From: Shuah Khan @ 2026-06-17 23:19 UTC (permalink / raw)
  To: Steven Rostedt, Hui Wang
  Cc: mhiramat, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
	wangfushuai, linux-kselftest, Shuah Khan
In-Reply-To: <20260608125049.092d4543@fedora>

On 6/8/26 10:50, Steven Rostedt wrote:
> On Sun,  7 Jun 2026 15:24:31 +0800
> Hui Wang <hui.wang@canonical.com> wrote:
> 
>> trace_marker_raw.tc assumes that the raw marker payload length
>> reported in trace_pipe is the result of int((id + 3) / 4) * 4, but
>> that is not true on kernels with CONFIG_HAVE_64BIT_ALIGNED_ACCESS
>> enabled.
>>
>> With forced 8-byte alignment, the ring buffer event forces 8-byte
>> alignment. The event length is stored in array[0], the payload data
>> and id are placed in a struct raw_data_entry which is stored starting
>> at array[1]. In this case, the printed payload data length is 8*N+4
>> bytes.
>>
>> To make the testcase pass in this case, add a kconfig_enabled() helper
>> and use it to detect CONFIG_HAVE_64BIT_ALIGNED_ACCESS so
>> trace_marker_raw.tc can calculate the expected length correctly.
>>
>> Assisted-by: Copilot:gpt-5.5
>> Signed-off-by: Hui Wang <hui.wang@canonical.com>
> 
> NACK
> 
> Let's not change the kernel for a broken test. Also this has already
> been fixed but appears not to be applied yet.
> 
> Shuah, can you please apply the below fix.
> 
>    https://lore.kernel.org/all/20260601023251.1916483-1-dtcccc@linux.alibaba.com/

I applied the above to linux-kselftest next - will send it up later
this week to for Linux 7.2-rc1

thanks,
-- Shuah

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Shuah Khan @ 2026-06-17 22:22 UTC (permalink / raw)
  To: Masami Hiramatsu (Google), Steven Rostedt
  Cc: CaoRuichuang, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest, Shuah Khan
In-Reply-To: <20260601133146.b16b0ad7c2204adcc168c945@kernel.org>

On 5/31/26 22:31, Masami Hiramatsu (Google) wrote:
> On Tue, 7 Apr 2026 20:37:27 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>
>> Shuah,
>>
>> Care to take this through your tree. Probably could even add:
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")
>>
>> As well as:
>>
>> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>>
> 
> Shuah, here is my ack too.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 

Thanks - sorry for the delay. I will send this up for Linux 7.2-rc1

thanks,
-- Shuah

^ permalink raw reply

* Re: [GIT PULL v2] RTLA additional fixes for v7.2
From: Steven Rostedt @ 2026-06-17 20:37 UTC (permalink / raw)
  To: Tomas Glozar; +Cc: LKML, linux-trace-kernel
In-Reply-To: <20260617153045.546686-1-tglozar@redhat.com>

On Wed, 17 Jun 2026 17:30:45 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> Steven,
> 
> The following changes since commit 6b5a2b7d9bc156e505f09e698d85d6a1547c1206:
> 
>   Merge tag 'trace-tools-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace (2026-06-16 17:50:34 +0530)
> 
> are available in the Git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-v7.2-fixups-v2
> 
> for you to fetch changes up to c35eb77a67515d4201bc91294f40761591f43bbd:
> 
>   rtla/tests: Fix pgrep filter in get_workload_pids.sh (2026-06-17 16:26:44 +0200)

Thanks,

As these are fixes they can still go in later in the merge window. This
week I'm really hurting for free time to work on upstream so this may
have to wait until next week. I don't want to rush and screw up again :-/

-- Steve

^ permalink raw reply

* [PATCH] usb: typec: add trace point for typec_set_mode
From: Ahmad Fatoum @ 2026-06-17 20:03 UTC (permalink / raw)
  To: Heikki Krogerus, Greg Kroah-Hartman, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-kernel, linux-usb, linux-trace-kernel, kernel, Ahmad Fatoum

Some Type-C controllers toggle muxes themselves. Other controllers like
the TUSB320 report the mode to the host, so it can control the muxes.

To improve debuggability of both kinds of drivers, add a trace point that
can be used to keep track of the mode being set inside the Type-C
framework:

  echo 1 > /sys/kernel/debug/tracing/events/typec/typec_mode/enable

Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 MAINTAINERS                  |  1 +
 drivers/usb/typec/class.c    |  9 ++++++++-
 include/trace/events/typec.h | 36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c8d4b913f26c..ddd59e5e6eaf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27753,6 +27753,7 @@ F:	Documentation/ABI/testing/sysfs-class-typec
 F:	Documentation/driver-api/usb/typec.rst
 F:	drivers/usb/typec/
 F:	include/linux/usb/typec.h
+F:	include/trace/events/typec*.h
 
 USB TYPEC INTEL PMC MUX DRIVER
 M:	Heikki Krogerus <heikki.krogerus@linux.intel.com>
diff --git a/drivers/usb/typec/class.c b/drivers/usb/typec/class.c
index 0977581ad1b6..9316d067f19a 100644
--- a/drivers/usb/typec/class.c
+++ b/drivers/usb/typec/class.c
@@ -20,6 +20,9 @@
 #include "class.h"
 #include "pd.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/typec.h>
+
 static DEFINE_IDA(typec_index_ida);
 
 const struct class typec_class = {
@@ -2427,10 +2430,14 @@ EXPORT_SYMBOL_GPL(typec_get_orientation);
 int typec_set_mode(struct typec_port *port, int mode)
 {
 	struct typec_mux_state state = { };
+	int ret;
 
 	state.mode = mode;
 
-	return typec_mux_set(port->mux, &state);
+	ret = typec_mux_set(port->mux, &state);
+	trace_typec_mode(port, mode, ret);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(typec_set_mode);
 
diff --git a/include/trace/events/typec.h b/include/trace/events/typec.h
new file mode 100644
index 000000000000..a7dcb9f3fd49
--- /dev/null
+++ b/include/trace/events/typec.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM typec
+
+#if !defined(_TRACE_TYPEC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TYPEC_H
+
+#include <linux/usb/typec.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(typec_mode,
+
+	TP_PROTO(struct typec_port *port, int mode, int err),
+
+	TP_ARGS(port, mode, err),
+
+	TP_STRUCT__entry(
+		__string(device, dev_name(&port->dev))
+		__field(int, mode)
+		__field(int, err)
+	),
+
+	TP_fast_assign(
+		__assign_str(device);
+		__entry->mode = mode;
+		__entry->err = err;
+	),
+
+	TP_printk("%s mode=%d (%d)",
+		  __get_str(device), __entry->mode, __entry->err)
+);
+
+#endif /* if !defined(_TRACE_TYPEC_H) || defined(TRACE_HEADER_MULTI_READ) */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

---
base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
change-id: 20260617-typec_set_mode-tracepoint-011fc43feaca

Best regards,
--  
Ahmad Fatoum <a.fatoum@pengutronix.de>


^ permalink raw reply related

* Re: [RFC PATCH 3/3] mm/compaction: respect compact_unevictable_allowed in alloc_contig path
From: Vlastimil Babka (SUSE) @ 2026-06-17 18:57 UTC (permalink / raw)
  To: Wandun Chen, linux-mm, linux-kernel, linux-trace-kernel,
	linux-rt-devel
  Cc: akpm, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
	mathieu.desnoyers, david, ljs, liam, rppt, bigeasy, clrkwllms,
	Alexander.Krabler
In-Reply-To: <20260604023812.3700316-4-chenwandun1@gmail.com>

On 6/4/26 04:38, Wandun Chen wrote:
> From: Wandun Chen <chenwandun@lixiang.com>
> 
> vm.compact_unevictable_allowed=0 is used to prevent compacting
> unevictable pages. However, isolate_migratepages_range() passes
> ISOLATE_UNEVICTABLE regardless of this sysctl, so the setting
> has no effect in the alloc_contig path.
> 
> Fix it by:
>   - Keep ISOLATE_UNEVICTABLE for CMA allocation, discussed in [1].
>   - Honour sysctl_compact_unevictable_allowed for non-CMA allocation.
> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
> Link: https://lore.kernel.org/all/25ba0d77-eb61-4efc-b2fc-73878cbd85c1@suse.cz/ [1]

There was also the "Ideally by not having mlock'd pages in CMA areas at
all." part. Is it the case? It was more elaborated here:
https://lore.kernel.org/all/CAPTztWZpnX1j8-7yeppVUsxE=O9hbVeqricDjZt8_pnN7a-kBQ@mail.gmail.com/

> ---
>  include/linux/compaction.h | 6 ++++++
>  mm/compaction.c            | 9 +++++++--
>  mm/internal.h              | 1 +
>  mm/page_alloc.c            | 2 ++
>  4 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index f29ef0653546..04e60f65b976 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -106,6 +106,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  extern void __meminit kcompactd_run(int nid);
>  extern void __meminit kcompactd_stop(int nid);
>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
> +extern bool compaction_allow_unevictable(void);
>  
>  #else
>  static inline void reset_isolation_suitable(pg_data_t *pgdat)
> @@ -131,6 +132,11 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat,
>  {
>  }
>  
> +static inline bool compaction_allow_unevictable(void)
> +{
> +	return true;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>  
>  struct node;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 007d5e00a8ae..a10acb273454 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1341,6 +1341,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>  							unsigned long end_pfn)
>  {
>  	unsigned long pfn, block_start_pfn, block_end_pfn;
> +	isolate_mode_t mode = cc->allow_unevictable ? ISOLATE_UNEVICTABLE : 0;
>  	int ret = 0;
>  
>  	/* Scan block by block. First and last block may be incomplete */
> @@ -1360,8 +1361,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>  					block_end_pfn, cc->zone))
>  			continue;
>  
> -		ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
> -						 ISOLATE_UNEVICTABLE);
> +		ret = isolate_migratepages_block(cc, pfn, block_end_pfn, mode);
>  
>  		if (ret)
>  			break;
> @@ -1902,6 +1902,11 @@ typedef enum {
>   * compactable pages.
>   */
>  static int sysctl_compact_unevictable_allowed __read_mostly = CONFIG_COMPACT_UNEVICTABLE_DEFAULT;
> +
> +bool compaction_allow_unevictable(void)
> +{
> +	return sysctl_compact_unevictable_allowed;
> +}
>  /*
>   * Tunable for proactive compaction. It determines how
>   * aggressively the kernel should compact memory in the
> diff --git a/mm/internal.h b/mm/internal.h
> index 181e79f1d6a2..163f9d6b37f3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1052,6 +1052,7 @@ struct compact_control {
>  					 * ensure forward progress.
>  					 */
>  	bool alloc_contig;		/* alloc_contig_range allocation */
> +	bool allow_unevictable;		/* Allow isolation of unevictable folios */
>  };
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81a9d4d1e6c0..1cf9d4a3b14c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7118,6 +7118,8 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>  		.ignore_skip_hint = true,
>  		.no_set_skip_hint = true,
>  		.alloc_contig = true,
> +		.allow_unevictable = !!(alloc_flags & ACR_FLAGS_CMA) ||
> +					     compaction_allow_unevictable(),
>  	};
>  	INIT_LIST_HEAD(&cc.migratepages);
>  	enum pb_isolate_mode mode = (alloc_flags & ACR_FLAGS_CMA) ?


^ permalink raw reply

* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Vlastimil Babka (SUSE) @ 2026-06-17 18:52 UTC (permalink / raw)
  To: Wandun Chen, linux-mm, linux-kernel, linux-trace-kernel,
	linux-rt-devel
  Cc: akpm, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
	mathieu.desnoyers, david, ljs, liam, rppt, bigeasy, clrkwllms,
	Alexander.Krabler, Hugh Dickins
In-Reply-To: <20260604023812.3700316-2-chenwandun1@gmail.com>

On 6/4/26 04:38, Wandun Chen wrote:
> From: Wandun Chen <chenwandun@lixiang.com>
> 
> compact_unevictable_allowed is default 0 under PREEMPT_RT,
> isolate_migratepages_block() skips folios with PG_unevictable set.
> However, mlock_folio() sets PG_mlocked immediately but defers
> PG_unevictable to mlock_folio_batch(), result in a folio with
> PG_mlocked=1 but PG_unevictable=0. Compaction will isolate such a
> folio.
> 
> Fix by checking folio_test_mlocked() together with the existing
> folio_test_unevictable() check.
> 
> A similar issue has been reported by Alexander Krabler on a 6.12-rt
> aarch64 system. Vlastimil suggested to check the mlocked flag [1].
> 
> Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
> Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
> Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]

Well in that thread, Hugh doubted my suggestion and then it seems we didn't
concluded anything. Did you actually in practice observe the issue that
Alexander had, and that this patch fixed it, or is that theoretical?

> ---
>  mm/compaction.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b776f35ad020..7e07b792bcb5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1116,7 +1116,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		is_unevictable = folio_test_unevictable(folio);
>  
>  		/* Compaction might skip unevictable pages but CMA takes them */
> -		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
> +		if (!(mode & ISOLATE_UNEVICTABLE) &&
> +		    (is_unevictable || folio_test_mlocked(folio)))
>  			goto isolate_fail_put;
>  
>  		/*


^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Vlastimil Babka (SUSE) @ 2026-06-17 18:18 UTC (permalink / raw)
  To: Shakeel Butt, David Hildenbrand (Arm)
  Cc: JP Kobryn, linux-mm, willy, usama.arif, akpm, mhocko, rostedt,
	mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <ajK1YsIJmD2ImbAk@linux.dev>

On 6/17/26 17:03, Shakeel Butt wrote:
> On Wed, Jun 17, 2026 at 01:11:16PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/10/26 21:52, JP Kobryn wrote:
>> > LRU add batches can be drained before they reach capacity. This can be a
>> > source of LRU lock contention, but it is not currently possible to
>> > attribute these drains to callers with existing tracepoints.
>> > 
>> > Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>> > lru_add batch is drained. This allows tracing to distinguish full drains
>> > from partial drains and attribute them to the calling stack.
>> > 
>> > Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
>> > whether they set the force flag for all CPUs. The tracepoint resembles
>> > the signature of the enclosing function, but is needed because of
>> > potential inlining.
>> > 
>> > Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>> > ---
>> >  include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
>> >  mm/swap.c                      |  7 ++++++-
>> >  2 files changed, 43 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>> > index 171524d3526d..ff3da07ccb40 100644
>> > --- a/include/trace/events/pagemap.h
>> > +++ b/include/trace/events/pagemap.h
>> > @@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
>> >  	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>> >  );
>> >  
>> > +TRACE_EVENT(mm_lru_add_drain,
>> > +
>> > +	TP_PROTO(int cpu, unsigned int nr),
>> > +
>> > +	TP_ARGS(cpu, nr),
>> > +
>> > +	TP_STRUCT__entry(
>> > +		__field(int,		cpu	)
>> > +		__field(unsigned int,	nr	)
>> > +	),
>> > +
>> > +	TP_fast_assign(
>> > +		__entry->cpu	= cpu;
>> > +		__entry->nr	= nr;
>> > +	),
>> > +
>> > +	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>> > +);
>> > +
>> > +TRACE_EVENT(mm_lru_add_drain_all,
>> > +
>> > +	TP_PROTO(bool force_all_cpus),
>> > +
>> > +	TP_ARGS(force_all_cpus),
>> > +
>> > +	TP_STRUCT__entry(
>> > +		__field(bool,	force_all_cpus	)
>> > +	),
>> > +
>> > +	TP_fast_assign(
>> > +		__entry->force_all_cpus	= force_all_cpus;
>> > +	),
>> > +
>> > +	TP_printk("force_all_cpus=%s",
>> > +		__entry->force_all_cpus ? "true" : "false")
>> > +);
>> > +
>> >  #endif /* _TRACE_PAGEMAP_H */
>> >  
>> >  /* This part must be outside protection */
>> > diff --git a/mm/swap.c b/mm/swap.c
>> > index 588f50d8f1a8..e14b7612f896 100644
>> > --- a/mm/swap.c
>> > +++ b/mm/swap.c
>> > @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>> >  {
>> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>> >  	struct folio_batch *fbatch = &fbatches->lru_add;
>> > +	unsigned int nr_folios_add = folio_batch_count(fbatch);
>> >  
>> > -	if (folio_batch_count(fbatch))
>> > +	if (nr_folios_add) {
>> >  		folio_batch_move_lru(fbatch, lru_add);
>> > +		trace_mm_lru_add_drain(cpu, nr_folios_add);
>> > +	}
>> >  
>> >  	fbatch = &fbatches->lru_move_tail;
>> >  	/* Disabling interrupts below acts as a compiler barrier. */
>> > @@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>> >  	if (WARN_ON(!mm_percpu_wq))
>> >  		return;
>> >  
>> > +	trace_mm_lru_add_drain_all(force_all_cpus);
>> > +
>> >  	/*
>> >  	 * Guarantee folio_batch counter stores visible by this CPU
>> >  	 * are visible to other CPUs before loading the current drain
>> 
>> Given that trace events can quickly become stable ABI [1], are we really sure we
>> want to add this?
> 
> Yes, I think so as this is useful to get insights into lru cache draining.
> Trace events being stable or not is secondary IMHO. If in future we rearchitect
> the lru page handling where there is no cache draining anymore, we can make
> these a noops.

Yeah and I don't recall ever that a change to a mm tracepoint would ever
break someone who'd complain and we'd have to revert it. These are niche
enough. So I think the risk is low.

>> 
>> [1] https://lore.kernel.org/r/20260603130006.7d2c4a62@gandalf.local.home
>> 
>> -- 
>> Cheers,
>> 
>> David


^ permalink raw reply

* Re: [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: Wen Yang @ 2026-06-17 17:11 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel
In-Reply-To: <2bcfa0bda551c0e1ba137b728dbe7886ff5c2579.camel@redhat.com>



On 6/17/26 23:41, Gabriele Monaco wrote:
> On Tue, 2026-06-16 at 00:44 +0800, wen.yang@linux.dev wrote:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> We occasionally hit a lockdep "Invalid wait context" warning in
>> production
>> environments when rv_react() callbacks are interrupted.
>>
>> The bug is intermittent in production. KUnit tests with busy-wait
>> callbacks
>> can reproduce it by holding the CPU long enough for a timer interrupt
>> to fire
>> during rv_react(), exposing the lockdep constraint violation:
>>
>> [   44.820913] =============================
>> [   44.820923] [ BUG: Invalid wait context ]
>> [   44.821137] 7.1.0-rc7-next-20260612-virtme #6 Tainted:
>> G                 N
>> [   44.821203] -----------------------------
> 
> It's nice to have reactors kunit coverage, I need to go through them
> more carefully but I like the idea.
> 
> Are those tests supposed to trigger this issue though? Under what
> configuration?
> 
> I reverted the lockdep fix and run the tests in vng on both x86_64 and
> arm64, both preempt_rt and not but I see no splat.
> Repeating the tests multiple times from debugfs also didn't seem to
> help. Both machines were relatively large (128 and 48 CPUs).
> 
> The config was the bare vng one with kunit built-in, lockdep and the
> reactors tests.
> 
> What am I missing?
> 

Thank you for your feedback.
I am using a WSL dev environment with 12 cores and 16GB. The config of 
the tested kernel code is as follows:


$ make savedefconfig

$ cat defconfig
CONFIG_WERROR=y
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_LOG_BUF_SHIFT=18
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_MISC=y
CONFIG_CGROUP_DEBUG=y
CONFIG_NAMESPACES=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_EXPERT=y
CONFIG_PROFILING=y
CONFIG_KEXEC=y
CONFIG_SMP=y
CONFIG_IOSF_MBI=y
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_NUMA=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_MIXED=y
CONFIG_HZ_1000=y
CONFIG_HIBERNATION=y
CONFIG_PM_DEBUG=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_BGRT=y
CONFIG_IA32_EMULATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
# CONFIG_SCHED_MC is not set
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_BLK_CGROUP_IOLATENCY=y
CONFIG_BLK_CGROUP_IOCOST=y
CONFIG_BLK_CGROUP_IOPRIO=y
CONFIG_BINFMT_MISC=y
# CONFIG_COMPAT_BRK is not set
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_ZONE_DEVICE=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_IPV6 is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_SCHED=y
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_CLS_ACT=y
CONFIG_DNS_RESOLVER=y
CONFIG_CGROUP_NET_PRIO=y
# CONFIG_WIRELESS is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI=y
CONFIG_PCCARD=y
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_DEBUG_DEVRES=y
CONFIG_CONNECTOR=y
CONFIG_FW_CFG_SYSFS=y
CONFIG_FW_CFG_SYSFS_CMDLINE=y
# CONFIG_EFI_DISABLE_RUNTIME is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_VIRTIO_BLK=y
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_SG=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_VIRTIO=y
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
CONFIG_ATA_PIIX=y
CONFIG_PATA_AMD=y
CONFIG_PATA_OLDPIIX=y
CONFIG_PATA_SCH=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_BLK_DEV_DM=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_NETCONSOLE=y
CONFIG_VIRTIO_NET=y
# CONFIG_ETHERNET is not set
CONFIG_PHYLIB=y
CONFIG_REALTEK_PHY=y
# CONFIG_WLAN is not set
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_JOYSTICK=y
CONFIG_INPUT_TABLET=y
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_INPUT_MISC=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
CONFIG_NVRAM=y
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
CONFIG_I2C_I801=y
CONFIG_PTP_1588_CLOCK=y
CONFIG_WATCHDOG=y
CONFIG_I6300ESB_WDT=y
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_DRM=y
# CONFIG_DRM_FBDEV_EMULATION is not set
CONFIG_DRM_BOCHS=y
CONFIG_DRM_VIRTIO_GPU=y
CONFIG_FB=y
CONFIG_FB_VESA=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_HRTIMER=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=y
# CONFIG_SND_DRIVERS is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_CODEC_REALTEK=y
# CONFIG_SND_PCMCIA is not set
# CONFIG_SND_X86 is not set
# CONFIG_HID is not set
CONFIG_RTC_CLASS=y
CONFIG_DMADEVICES=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_INPUT=y
CONFIG_VIRTIO_MMIO=y
CONFIG_EEEPC_LAPTOP=y
CONFIG_ACPI_WMI=y
CONFIG_MAILBOX=y
CONFIG_PCC=y
CONFIG_AMD_IOMMU=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_IRQ_REMAP=y
CONFIG_VIRTIO_IOMMU=y
CONFIG_FS_DAX=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_QFMT_V2=y
CONFIG_FUSE_FS=y
CONFIG_VIRTIO_FS=y
CONFIG_OVERLAY_FS=y
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_PROC_KCORE=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_SQUASHFS=y
CONFIG_SQUASHFS_XZ=y
CONFIG_SQUASHFS_ZSTD=y
CONFIG_9P_FS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
CONFIG_NLS_UTF8=y
CONFIG_KEYS=y
CONFIG_SECURITYFS=y
CONFIG_CRYPTO_AUTHENC=y
CONFIG_CRYPTO_RSA=y
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_SHA256=y
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
CONFIG_PKCS7_MESSAGE_PARSER=y
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_PRINTK_TIME=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_WX=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_ATOMIC=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_CSD_LOCK_WAIT_DEBUG=y
CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT=y
CONFIG_DEBUG_KOBJECT=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_RV=y
CONFIG_RV_MON_WWNR=y
CONFIG_RV_MON_RTAPP=y
CONFIG_RV_MON_STALL=y
CONFIG_RV_MON_DEADLINE=y
CONFIG_RV_REACT_PRINTK_KUNIT=y
CONFIG_RV_REACT_PANIC_KUNIT=y
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_BOOT_PARAMS=y
CONFIG_DEBUG_ENTRY=y
CONFIG_KUNIT=y
# CONFIG_KUNIT_DEBUGFS is not set


And then, using vng to build and run kselftests (since kunit is already 
built-in) can reproduce this issue:

$ vng --build

$ vng -v --run arch/x86/boot/bzImage --user root -- 
tools/testing/selftests/verification/verificationtest-ktap


--
Best wishes,
Wen


> Thanks,
> Gabriele
> 
>> [   44.821211] kunit_try_catch/209 is trying to lock:
>> [   44.821244] ffff8a743ed3e8a0 (&rq->__lock){-...}-{2:2}, at:
>> __schedule+0x102/0x13d0
>> [   44.821688] other info that might help us debug this:
>> [   44.821708] context-{5:5}
>> [   44.821730] 1 lock held by kunit_try_catch/209:
>> [   44.821745]  #0: ffffffffb6ba62c0 (rv_react_map-wait-type-
>> override){+.+.}-{1:1}, at: rv_react+0x9d/0xf0
>> [   44.821803] stack backtrace:
>> [   44.822110] CPU: 10 UID: 0 PID: 209 Comm: kunit_try_catch Tainted:
>> G                 N  7.1.0-rc7-next-20260612-virtme #6
>> PREEMPT_{RT,(full)}
>> [   44.822197] Tainted: [N]=TEST
>> [   44.822210] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX,
>> arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [   44.822328] Call Trace:
>> [   44.822377]  <TASK>
>> [   44.822806]  dump_stack_lvl+0x78/0xe0
>> [   44.822860]  __lock_acquire+0x926/0x1c90
>> [   44.822888]  lock_acquire+0xd3/0x310
>> [   44.822901]  ? __schedule+0x102/0x13d0
>> [   44.822919]  ? rcu_qs+0x2d/0x1a0
>> [   44.822954]  _raw_spin_lock_nested+0x36/0x50
>> [   44.822966]  ? __schedule+0x102/0x13d0
>> [   44.822979]  __schedule+0x102/0x13d0
>> [   44.822993]  ? mark_held_locks+0x40/0x70
>> [   44.823009]  preempt_schedule_irq+0x37/0x70
>> [   44.823018]  irqentry_exit+0x1da/0x8c0
>> [   44.823032]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [   44.823093] RIP: 0010:mock_printk_react+0x2a/0x50
>> [   44.823250] Code: f3 0f 1e fa 0f 1f 44 00 00 41 54 49 89 f4 55 48
>> 89 fd 53 e8 18 8b db ff 4c 89 e6 48 89 ef 48 89 c3 e8 fa 8e ed ff eb
>> 02 f3 90 <e8> 01 8b db ff 48 29 d8 48 3d 3f 4b 4c 00 76 ee 5b 5d 41
>> 5c c3 cc
>> [   44.823303] RSP: 0018:ffffd1c3c0733d38 EFLAGS: 00000297
>> [   44.823332] RAX: 00000000000119f3 RBX: 0000000a74e60d1c RCX:
>> 000000000000001f
>> [   44.823342] RDX: 0000000000000000 RSI: 000000003348c8a2 RDI:
>> ffffffffc1abbfd9
>> [   44.823351] RBP: ffffffffb671b613 R08: 0000000000000002 R09:
>> 0000000000000000
>> [   44.823359] R10: 0000000000000001 R11: 0000000000000000 R12:
>> ffffd1c3c0733d60
>> [   44.823367] R13: ffffffffb575a5fd R14: ffffd1c3c0017be8 R15:
>> ffffd1c3c00179f8
>> [   44.823397]  ? rv_react+0x9d/0xf0
>> [   44.823437]  ? mock_printk_react+0x2f/0x50
>> [   44.823448]  rv_react+0xb4/0xf0
>> [   44.823455]  ? rv_react+0x9d/0xf0
>> [   44.823476]  test_printk_react_called+0x83/0xb0
>> [   44.823486]  ? __pfx_mock_printk_react+0x10/0x10
>> [   44.823502]  ? __pfx_mock_printk_react+0x10/0x10
>> [   44.823513]  kunit_try_run_case+0x97/0x190
>> [   44.823534]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
>> [   44.823544]  kunit_generic_run_threadfn_adapter+0x21/0x40
>> [   44.823551]  kthread+0x124/0x160
>> [   44.823562]  ? __pfx_kthread+0x10/0x10
>> [   44.823574]  ret_from_fork+0x291/0x3b0
>> [   44.823585]  ? __pfx_kthread+0x10/0x10
>> [   44.823595]  ret_from_fork_asm+0x1a/0x30
>> [   44.823641]  </TASK>
>>
>>
>> Patch 1 fixes the lockdep bug by correcting rv_react()'s
>> wait_type_inner
>> from LD_WAIT_CONFIG (which inherits the outer context) to
>> LD_WAIT_SPIN
>> (the tightest constraint callbacks must satisfy).
>>
>> Patch 2 adds KUnit tests for reactor_printk. The busy-wait in the
>> mock
>> callback reproduces the timer interrupt scenario that exposes the
>> bug.
>>
>> Patch 3 adds KUnit tests for reactor_panic, exercising the panic
>> notifier
>> chain without halting the system.
>>
>> Tested with CONFIG_PROVE_LOCKING=y and CONFIG_KUNIT=y.
>>
>>
>> Wen Yang (3):
>>    rv/reactors: fix lockdep "Invalid wait context" in rv_react()
>>    rv/reactors: add KUnit tests for reactor_printk
>>    rv/reactors: add KUnit tests for reactor_panic
>>
>>   kernel/trace/rv/Kconfig                |  20 ++++
>>   kernel/trace/rv/Makefile               |   2 +
>>   kernel/trace/rv/reactor_panic_kunit.c  | 106 +++++++++++++++++++++
>>   kernel/trace/rv/reactor_printk_kunit.c | 123
>> +++++++++++++++++++++++++
>>   kernel/trace/rv/rv_reactors.c          |   8 +-
>>   5 files changed, 258 insertions(+), 1 deletion(-)
>>   create mode 100644 kernel/trace/rv/reactor_panic_kunit.c
>>   create mode 100644 kernel/trace/rv/reactor_printk_kunit.c
> 

^ permalink raw reply

* Re: [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: Gabriele Monaco @ 2026-06-17 16:14 UTC (permalink / raw)
  To: Nam Cao, wen.yang; +Cc: linux-trace-kernel, linux-kernel
In-Reply-To: <874ij16u6i.fsf@yellow.woof>

On Wed, 2026-06-17 at 17:52 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > Are those tests supposed to trigger this issue though? Under what
> > configuration?
> > 
> > I reverted the lockdep fix and run the tests in vng on both x86_64
> > and arm64, both preempt_rt and not but I see no splat.
> > Repeating the tests multiple times from debugfs also didn't seem to
> > help. Both machines were relatively large (128 and 48 CPUs).
> > 
> > The config was the bare vng one with kunit built-in, lockdep and
> > the reactors tests.
> > 
> > What am I missing?
> 
> I haven't tried to reproduce it, but seems quite rare. From the look
> of it, adding some delay into the reactor function should make the
> issue more easily reproducible.

Yeah the tests should be doing that, but even increasing the delay
didn't help. I should probably try on physical machines to have more
likely interrupts but at least the tick should be running.


^ permalink raw reply

* Re: [PATCH 1/3] rv/reactors: fix lockdep "Invalid wait context" in rv_react()
From: Nam Cao @ 2026-06-17 15:58 UTC (permalink / raw)
  To: wen.yang, Gabriele Monaco
  Cc: linux-trace-kernel, linux-kernel, Wen Yang, Thomas Weißschuh
In-Reply-To: <bc01343ae74acf6bdf142434aeaa4e6b40aa72a9.1781541556.git.wen.yang@linux.dev>

wen.yang@linux.dev writes:
>  void rv_react(struct rv_monitor *monitor, const char *msg, ...)
>  {
> -	static DEFINE_WAIT_OVERRIDE_MAP(rv_react_map, LD_WAIT_FREE);
> +#ifdef CONFIG_LOCKDEP
> +	static struct lockdep_map rv_react_map = {
> +		.name = "rv_react",
> +		.wait_type_outer = LD_WAIT_FREE,
> +		.wait_type_inner = LD_WAIT_SPIN,
> +	};
> +#endif
>  	va_list args;
>  
>  	if (!rv_reacting_on() || !monitor->react)

From my limited understanding of lockdep, this looks fine to me. It now
will not warn us if reactor takes a raw_spin_lock, but I think it's fine.

But I would wait for Thomas's thought on this. He will be back next
week.

Nam

^ permalink raw reply

* Re: [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: Nam Cao @ 2026-06-17 15:52 UTC (permalink / raw)
  To: Gabriele Monaco, wen.yang; +Cc: linux-trace-kernel, linux-kernel
In-Reply-To: <2bcfa0bda551c0e1ba137b728dbe7886ff5c2579.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Are those tests supposed to trigger this issue though? Under what
> configuration?
>
> I reverted the lockdep fix and run the tests in vng on both x86_64 and
> arm64, both preempt_rt and not but I see no splat.
> Repeating the tests multiple times from debugfs also didn't seem to
> help. Both machines were relatively large (128 and 48 CPUs).
>
> The config was the bare vng one with kunit built-in, lockdep and the
> reactors tests.
>
> What am I missing?

I haven't tried to reproduce it, but seems quite rare. From the look of
it, adding some delay into the reactor function should make the issue
more easily reproducible.

Nam

^ permalink raw reply

* Re: [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: Gabriele Monaco @ 2026-06-17 15:41 UTC (permalink / raw)
  To: wen.yang; +Cc: Nam Cao, linux-trace-kernel, linux-kernel
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>

On Tue, 2026-06-16 at 00:44 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> We occasionally hit a lockdep "Invalid wait context" warning in
> production
> environments when rv_react() callbacks are interrupted.
> 
> The bug is intermittent in production. KUnit tests with busy-wait
> callbacks
> can reproduce it by holding the CPU long enough for a timer interrupt
> to fire
> during rv_react(), exposing the lockdep constraint violation:
> 
> [   44.820913] =============================
> [   44.820923] [ BUG: Invalid wait context ]
> [   44.821137] 7.1.0-rc7-next-20260612-virtme #6 Tainted:
> G                 N
> [   44.821203] -----------------------------

It's nice to have reactors kunit coverage, I need to go through them
more carefully but I like the idea.

Are those tests supposed to trigger this issue though? Under what
configuration?

I reverted the lockdep fix and run the tests in vng on both x86_64 and
arm64, both preempt_rt and not but I see no splat.
Repeating the tests multiple times from debugfs also didn't seem to
help. Both machines were relatively large (128 and 48 CPUs).

The config was the bare vng one with kunit built-in, lockdep and the
reactors tests.

What am I missing?

Thanks,
Gabriele

> [   44.821211] kunit_try_catch/209 is trying to lock:
> [   44.821244] ffff8a743ed3e8a0 (&rq->__lock){-...}-{2:2}, at:
> __schedule+0x102/0x13d0
> [   44.821688] other info that might help us debug this:
> [   44.821708] context-{5:5}
> [   44.821730] 1 lock held by kunit_try_catch/209:
> [   44.821745]  #0: ffffffffb6ba62c0 (rv_react_map-wait-type-
> override){+.+.}-{1:1}, at: rv_react+0x9d/0xf0
> [   44.821803] stack backtrace:
> [   44.822110] CPU: 10 UID: 0 PID: 209 Comm: kunit_try_catch Tainted:
> G                 N  7.1.0-rc7-next-20260612-virtme #6
> PREEMPT_{RT,(full)}
> [   44.822197] Tainted: [N]=TEST
> [   44.822210] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX,
> arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   44.822328] Call Trace:
> [   44.822377]  <TASK>
> [   44.822806]  dump_stack_lvl+0x78/0xe0
> [   44.822860]  __lock_acquire+0x926/0x1c90
> [   44.822888]  lock_acquire+0xd3/0x310
> [   44.822901]  ? __schedule+0x102/0x13d0
> [   44.822919]  ? rcu_qs+0x2d/0x1a0
> [   44.822954]  _raw_spin_lock_nested+0x36/0x50
> [   44.822966]  ? __schedule+0x102/0x13d0
> [   44.822979]  __schedule+0x102/0x13d0
> [   44.822993]  ? mark_held_locks+0x40/0x70
> [   44.823009]  preempt_schedule_irq+0x37/0x70
> [   44.823018]  irqentry_exit+0x1da/0x8c0
> [   44.823032]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   44.823093] RIP: 0010:mock_printk_react+0x2a/0x50
> [   44.823250] Code: f3 0f 1e fa 0f 1f 44 00 00 41 54 49 89 f4 55 48
> 89 fd 53 e8 18 8b db ff 4c 89 e6 48 89 ef 48 89 c3 e8 fa 8e ed ff eb
> 02 f3 90 <e8> 01 8b db ff 48 29 d8 48 3d 3f 4b 4c 00 76 ee 5b 5d 41
> 5c c3 cc
> [   44.823303] RSP: 0018:ffffd1c3c0733d38 EFLAGS: 00000297
> [   44.823332] RAX: 00000000000119f3 RBX: 0000000a74e60d1c RCX:
> 000000000000001f
> [   44.823342] RDX: 0000000000000000 RSI: 000000003348c8a2 RDI:
> ffffffffc1abbfd9
> [   44.823351] RBP: ffffffffb671b613 R08: 0000000000000002 R09:
> 0000000000000000
> [   44.823359] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffd1c3c0733d60
> [   44.823367] R13: ffffffffb575a5fd R14: ffffd1c3c0017be8 R15:
> ffffd1c3c00179f8
> [   44.823397]  ? rv_react+0x9d/0xf0
> [   44.823437]  ? mock_printk_react+0x2f/0x50
> [   44.823448]  rv_react+0xb4/0xf0
> [   44.823455]  ? rv_react+0x9d/0xf0
> [   44.823476]  test_printk_react_called+0x83/0xb0
> [   44.823486]  ? __pfx_mock_printk_react+0x10/0x10
> [   44.823502]  ? __pfx_mock_printk_react+0x10/0x10
> [   44.823513]  kunit_try_run_case+0x97/0x190
> [   44.823534]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
> [   44.823544]  kunit_generic_run_threadfn_adapter+0x21/0x40
> [   44.823551]  kthread+0x124/0x160
> [   44.823562]  ? __pfx_kthread+0x10/0x10
> [   44.823574]  ret_from_fork+0x291/0x3b0
> [   44.823585]  ? __pfx_kthread+0x10/0x10
> [   44.823595]  ret_from_fork_asm+0x1a/0x30
> [   44.823641]  </TASK>
> 
> 
> Patch 1 fixes the lockdep bug by correcting rv_react()'s
> wait_type_inner
> from LD_WAIT_CONFIG (which inherits the outer context) to
> LD_WAIT_SPIN
> (the tightest constraint callbacks must satisfy).
> 
> Patch 2 adds KUnit tests for reactor_printk. The busy-wait in the
> mock
> callback reproduces the timer interrupt scenario that exposes the
> bug.
> 
> Patch 3 adds KUnit tests for reactor_panic, exercising the panic
> notifier
> chain without halting the system.
> 
> Tested with CONFIG_PROVE_LOCKING=y and CONFIG_KUNIT=y.
> 
> 
> Wen Yang (3):
>   rv/reactors: fix lockdep "Invalid wait context" in rv_react()
>   rv/reactors: add KUnit tests for reactor_printk
>   rv/reactors: add KUnit tests for reactor_panic
> 
>  kernel/trace/rv/Kconfig                |  20 ++++
>  kernel/trace/rv/Makefile               |   2 +
>  kernel/trace/rv/reactor_panic_kunit.c  | 106 +++++++++++++++++++++
>  kernel/trace/rv/reactor_printk_kunit.c | 123
> +++++++++++++++++++++++++
>  kernel/trace/rv/rv_reactors.c          |   8 +-
>  5 files changed, 258 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/trace/rv/reactor_panic_kunit.c
>  create mode 100644 kernel/trace/rv/reactor_printk_kunit.c


^ permalink raw reply

* [GIT PULL v2] RTLA additional fixes for v7.2
From: Tomas Glozar @ 2026-06-17 15:30 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, linux-trace-kernel, Tomas Glozar

Steven,

The following changes since commit 6b5a2b7d9bc156e505f09e698d85d6a1547c1206:

  Merge tag 'trace-tools-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace (2026-06-16 17:50:34 +0530)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-v7.2-fixups-v2

for you to fetch changes up to c35eb77a67515d4201bc91294f40761591f43bbd:

  rtla/tests: Fix pgrep filter in get_workload_pids.sh (2026-06-17 16:26:44 +0200)

----------------------------------------------------------------
RTLA additional fixes for v7.2

- Fix and clean up .gitignore

Narrow match range of entries in .gitignore to only what is needed,
fixing "lib/" matching tools/tracing/rtla/tests/scripts/lib/*.

- Fix pgrep filter in runtime tests

Make the pgrep filter used by runtime tests to get workload PIDs work
on both older and newer versions of pgrep, regardless of whether
square brackets are counted as part of kthread comm or not.

Build, runtime tests, unit tests pass.

v2:
- Rebase onto 6b5a2b7d9bc156e505f09e698d85d6a1547c1206 to avoid merge
  conflicts.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>

----------------------------------------------------------------
Tomas Glozar (2):
      rtla: Fix and clean up .gitignore
      rtla/tests: Fix pgrep filter in get_workload_pids.sh

 tools/tracing/rtla/.gitignore                             | 13 ++++---------
 tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh |  2 +-
 2 files changed, 5 insertions(+), 10 deletions(-)


^ permalink raw reply

* Re: [PATCH v3 9/9] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-06-17 15:09 UTC (permalink / raw)
  To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <4aeb668c8446a9f6366d92e218df386bef7bc965.1780847473.git.wen.yang@linux.dev>

On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add selftest coverage for the tlob uprobe monitoring interface under
> tools/testing/selftests/verification/.
> 
> test.d/tlob/ contains both the helper sources (tlob_target, tlob_sym)
> and the seven test scripts so the test suite is self-contained.
> tlob_target provides busy-spin, sleep, and preempt workloads;
> tlob_sym
> resolves ELF symbol offsets for uprobe registration.
> 
> Seven test scripts exercise uprobe binding management, budget
> violation
> detection, and per-state time accounting (running_ns, waiting_ns,
> sleeping_ns).
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Tests look fine and coverage is good, thanks!

Minor comments follow.

> ---
>  .../testing/selftests/verification/.gitignore |   2 +
>  tools/testing/selftests/verification/Makefile |  19 +-
>  .../verification/test.d/tlob/Makefile         |  20 ++
>  .../verification/test.d/tlob/test.d/functions |   1 +
>  .../verification/test.d/tlob/tlob_sym.c       | 189
> ++++++++++++++++++
>  .../verification/test.d/tlob/tlob_target.c    | 138 +++++++++++++
>  .../verification/test.d/tlob/uprobe_bind.tc   |  37 ++++

>  .../test.d/tlob/uprobe_detail_running.tc      |  51 +++++
>  .../test.d/tlob/uprobe_detail_sleeping.tc     |  50 +++++
>  .../test.d/tlob/uprobe_detail_waiting.tc      |  66 ++++++

Not sure if this would work, but just to lower the maintenance burden,
couldn't we put these 3 in the same test case? You could define a bash
function and pass "running", "sleeping" or "waiting" and whether to launch the
hog to that.

Only waiting uses a taskset and a slightly different ordering, but wouldn't
they all work fine like that?

...

> a/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
> b/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
> new file mode 100644
> index 000000000000..1b7ba1c6d95b
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_sym.c - ELF symbol-to-file-offset utility for tlob selftests
> + *
> + * Usage: tlob_sym sym_offset <binary> <symbol>
> + *
> + *   Prints the ELF file offset of <symbol> in <binary> to stdout.
> + *
> + * Exit: 0 = found, 1 = error / not found.
> + */

I wonder if instead of maintaining a pure C solution we couldn't live
with something like:

  sym_offset() { # target symbol
    readelf -W -S -s $1 | awk -v symbol="$2" '
      { gsub(/\[ /, "[") }  # normalise section markers
      $1 ~ /^\[[0-9]+\]$/ { sections[$1]="0x"$4; offsets[$1]="0x"$5 }
      $1 ~ /^[0-9]+:$/ && $NF == symbol { addr="0x"$2; sec="["$7"]" }
      END { printf "printf \"0x%%x\\n\" $((%s - %s + %s))\n", addr, sections[sec], offsets[sec] }
    ' | sh
  }

...

> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/tlob_target.c
> b/tools/testing/selftests/verification/test.d/tlob/tlob_target.c
> new file mode 100644
> index 000000000000..0fdbc575d71d
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tlob_target.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides three start/stop probe pairs, each designed to exercise
> a
> + * different dominant component of the detail_env_tlob ns breakdown:
> + *
> + *   tlob_busy_work    / tlob_busy_work_done    - busy-spin:
> running_ns dominates
> + *   tlob_sleep_work   / tlob_sleep_work_done   - nanosleep:
> sleeping_ns dominates
> + *   tlob_preempt_work / tlob_preempt_work_done - busy-spin:
> waiting_ns dominates
> + *                                                (needs an RT

In short tlob_preempt_work is the same as tlob_busy_work, isn't it? Do we need
them both? Cannot you just have a hog in the test and keep using the same
function?

...

> +
> +	do {
> +		if (strcmp(mode, "sleep") == 0)
> +			tlob_sleep_work(200);
> +		else if (strcmp(mode, "preempt") == 0)
> +			tlob_preempt_work(200);
> +		else
> +			tlob_busy_work(200 * 1000000UL);

The only difference I see is that you multiply by 1000000UL here for busy and
in the function for preempt.
Cannot we make them all consistent (call with 200 and do the math inside)?


> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> new file mode 100644
> index 000000000000..1ac3db6ca7bb
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> @@ -0,0 +1,37 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob monitor uprobe binding (visible in monitor
> file, removable, duplicate rejected)
> +# requires: tlob:monitor
> +
> +RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
> +UPROBE_TARGET="${RV_BINDIR}/tlob_target"
> +TLOB_SYM="${RV_BINDIR}/tlob_sym"
> +[ -x "$UPROBE_TARGET" ] || exit_unsupported
> +[ -x "$TLOB_SYM" ]      || exit_unsupported

If those aren't ready, the build system didn't work, I don't think we need to
check here, it's just a clear error.

> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET"
> tlob_busy_work_done 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported

Kind of the same here, the rest of the test should probably fail (EINVAL in the
monitor or whatever). The script will print everything with set -x and it
should be clear what was missing.

> +command -v chrt    > /dev/null || exit_unsupported
> +command -v taskset > /dev/null || exit_unsupported

Not sure how common it is not to have those, but this is exactly what the
:program under requires: is for (see rv_wwnr_printk with stress-ng).

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Shakeel Butt @ 2026-06-17 15:03 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: JP Kobryn, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <06122cae-e28b-4ded-a9dd-d380d31c5230@kernel.org>

On Wed, Jun 17, 2026 at 01:11:16PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 21:52, JP Kobryn wrote:
> > LRU add batches can be drained before they reach capacity. This can be a
> > source of LRU lock contention, but it is not currently possible to
> > attribute these drains to callers with existing tracepoints.
> > 
> > Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> > lru_add batch is drained. This allows tracing to distinguish full drains
> > from partial drains and attribute them to the calling stack.
> > 
> > Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> > whether they set the force flag for all CPUs. The tracepoint resembles
> > the signature of the enclosing function, but is needed because of
> > potential inlining.
> > 
> > Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
> > ---
> >  include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
> >  mm/swap.c                      |  7 ++++++-
> >  2 files changed, 43 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
> > index 171524d3526d..ff3da07ccb40 100644
> > --- a/include/trace/events/pagemap.h
> > +++ b/include/trace/events/pagemap.h
> > @@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
> >  	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
> >  );
> >  
> > +TRACE_EVENT(mm_lru_add_drain,
> > +
> > +	TP_PROTO(int cpu, unsigned int nr),
> > +
> > +	TP_ARGS(cpu, nr),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(int,		cpu	)
> > +		__field(unsigned int,	nr	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->cpu	= cpu;
> > +		__entry->nr	= nr;
> > +	),
> > +
> > +	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
> > +);
> > +
> > +TRACE_EVENT(mm_lru_add_drain_all,
> > +
> > +	TP_PROTO(bool force_all_cpus),
> > +
> > +	TP_ARGS(force_all_cpus),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(bool,	force_all_cpus	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->force_all_cpus	= force_all_cpus;
> > +	),
> > +
> > +	TP_printk("force_all_cpus=%s",
> > +		__entry->force_all_cpus ? "true" : "false")
> > +);
> > +
> >  #endif /* _TRACE_PAGEMAP_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 588f50d8f1a8..e14b7612f896 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
> >  {
> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> >  	struct folio_batch *fbatch = &fbatches->lru_add;
> > +	unsigned int nr_folios_add = folio_batch_count(fbatch);
> >  
> > -	if (folio_batch_count(fbatch))
> > +	if (nr_folios_add) {
> >  		folio_batch_move_lru(fbatch, lru_add);
> > +		trace_mm_lru_add_drain(cpu, nr_folios_add);
> > +	}
> >  
> >  	fbatch = &fbatches->lru_move_tail;
> >  	/* Disabling interrupts below acts as a compiler barrier. */
> > @@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
> >  	if (WARN_ON(!mm_percpu_wq))
> >  		return;
> >  
> > +	trace_mm_lru_add_drain_all(force_all_cpus);
> > +
> >  	/*
> >  	 * Guarantee folio_batch counter stores visible by this CPU
> >  	 * are visible to other CPUs before loading the current drain
> 
> Given that trace events can quickly become stable ABI [1], are we really sure we
> want to add this?

Yes, I think so as this is useful to get insights into lru cache draining.
Trace events being stable or not is secondary IMHO. If in future we rearchitect
the lru page handling where there is no cache draining anymore, we can make
these a noops.

> 
> [1] https://lore.kernel.org/r/20260603130006.7d2c4a62@gandalf.local.home
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-17 14:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
	linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
	dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ajIb4DJdLGPbMB4V@parvat>

On Wed, Jun 17, 2026 at 02:02:47PM +1000, Balbir Singh wrote:
> On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> > On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > > On 6/10/26 12:41, Gregory Price wrote:
> > > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > > 
> > 
> > For mm/slub.c we can choose to do one of thwo things
> > 
> >   1) 100% refuse slab allocations on private nodes, i.e.:
> > 
> >      kmalloc_node(..., private_nid, __GFP_THISNODE)
> > 
> >      And will fail (return NULL).
> > 
> 
> Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
> be in the regular for_each(...) loops
> 

If a node is in neither FALLBACK nor NOFALLBACK - it is *completely*
unreachable in the current page allocator.

Next RFC I've reduced this to create a ZONELIST_PRIVATE separate from
the ZONELIST_FALLBACK and ZONELIST_NOFALLBACK, and an explicit folio
allocation interface that selects which fallback list to use.

the feedback in the past week has been helpful in honing in on a
solution that I think is generalizable.  Have just been taking the time
to test various behaviors to make sure I haven't been regressing any
userland API/ABIs (mbind, mempolicy, etc).

~Gregory

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox