* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01 16:07 UTC (permalink / raw)
To: David Hildenbrand (Arm), Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <0bb49c47-8c41-478c-847e-b9154c75e59c@kernel.org>
On 2026/6/1 23:05, David Hildenbrand (Arm) wrote:
> On 6/1/26 17:00, Nico Pache wrote:
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>> On 6/1/26 12:47, Lance Yang wrote:
>>>>
>>>>
>>>>
>>>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>>>
>>>> So yeah, no extra smp_wmb() needed here!
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
>>
>> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
>> before walking a PTE table?
>
> But how would they update the cache then correctly?
>
> I'm too non-MIPS to know the answer :)
Right, that's my concern as well ...
If MIPS sees pmd_none(), it has no PTE table to walk, so it also has
no way to do the cache update it wanted to do, I guess :)
But, the PTE table is not really gone there. khugepaged only cleared
the PMD temporarily while still using the old PTE table through _pmd.
So I'd go with David's suggestion:
"
Best to make sure the page table is already installed when updating
the entries.
"
Cheers, Lance
^ permalink raw reply
* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 16:21 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260531101458.c8ee22f6222a3fc224cc5328@kernel.org>
On Sun, 31 May 2026 10:14:58 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > Does this prematurely release the BTF struct reference?
> > > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > > later steps in traceprobe_parse_probe_arg_body() (like
> > > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > > will evaluate to ctx->btf (which is NULL for eprobes).
> > > Could this potentially lead to silent defaults, such as 64-bit reads for
> > > smaller fields, or fail to inject pointer dereferences for string fields,
> > > while also leaving ctx->last_type pointing to a prematurely released BTF
> > > object?
> >
> > Does this mean we need to set ctx->last_type to NULL here too?
>
> No, since the member we refer can be different from unsigned long.
> When we don't have ":type" suffix, we use BTF type information to
> decide appropriate type.
>
> >
> > Because everything above is pretty much the expected behavior. The put is
> > *not* premature. The last_struct and struct_btf are both set to NULL. I
> > guess the only thing missing is to reset last_type as well.
>
> No, as I explained, the last_type is used to determine the member type
> when user does not specify the ":type" suffix.
>
> So, what we need to do is deferring the btf_put(struct_btf) as below:
> (no build test yet.)
OK, but I don't think we want the struct_btf to exist beyond a single
arg like the btf descriptor does. How about this (on top of this change),
where it clears the struct_btf at the end of traceprobe_parse_probe_arg_body()?
Also, I see the flag as being redundant and use the existence of
struct_btf to denote that it's parsing a typedef struct.
-- Steve
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9246e9c3d066..56b7dc406ca1 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -397,8 +397,7 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
{
- return ctx->flags & TPARG_FL_TYPECAST ?
- ctx->struct_btf : ctx->btf;
+ return ctx->struct_btf ? : ctx->btf;
}
static int check_prepare_btf_string_fetch(char *typename,
@@ -531,6 +530,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
return 0;
}
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ if (ctx->struct_btf) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ ctx->last_struct = NULL;
+ }
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
if (ctx->btf) {
@@ -579,7 +587,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
- bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+ bool is_struct = ctx->struct_btf != NULL;
struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
@@ -690,7 +698,7 @@ static int parse_btf_arg(char *varname,
ret = parse_trace_event(varname, code, ctx);
if (ret < 0)
return ret;
- if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
+ if (WARN_ON_ONCE(ctx->struct_btf == NULL))
return -EINVAL;
type = ctx->last_struct;
goto found_type;
@@ -804,21 +812,19 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
{
+ struct btf *btf = NULL;
int id;
- if (!ctx->struct_btf) {
- struct btf *btf;
-
- id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
- if (id < 0)
- return id;
- ctx->struct_btf = btf;
- } else {
- id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
- if (id < 0)
- return id;
+ /* Could be a for a structure in a different module */
+ if (ctx->struct_btf) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
}
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
return 0;
}
@@ -848,25 +854,23 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
if (ret < 0) {
trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
- ret = -EINVAL;
- goto out_put;
+ return -EINVAL;
}
- ctx->flags |= TPARG_FL_TYPECAST;
tmp++;
ctx->offset += tmp - arg;
ret = parse_btf_arg(tmp, pcode, end, ctx);
- ctx->flags &= ~TPARG_FL_TYPECAST;
- ctx->last_struct = NULL;
-out_put:
- btf_put(ctx->struct_btf);
- ctx->struct_btf = NULL;
return ret;
}
#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ ctx->struct_btf = NULL;
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -1673,6 +1677,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
}
kfree(tmp);
+ /* struct_btf should not be passed to other arguments */
+ clear_struct_btf(ctx);
+
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 952e3d7582b8..83565f1634db 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,7 +394,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
* TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
* TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
* TPARG_FL_KERNEL.
- * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
*/
#define TPARG_FL_RETURN BIT(0)
#define TPARG_FL_KERNEL BIT(1)
@@ -403,7 +402,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
#define TPARG_FL_USER BIT(4)
#define TPARG_FL_FPROBE BIT(5)
#define TPARG_FL_TPOINT BIT(6)
-#define TPARG_FL_TYPECAST BIT(7)
#define TPARG_FL_LOC_MASK GENMASK(4, 0)
static inline bool tparg_is_function_entry(unsigned int flags)
^ permalink raw reply related
* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Nico Pache @ 2026-06-01 17:05 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Andrew Morton, Gerald Schaefer, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
linux-s390, linux-next
In-Reply-To: <20260601155808.2755103A59-agordeev@linux.ibm.com>
On Mon, Jun 1, 2026 at 9:58 AM Alexander Gordeev <agordeev@linux.ibm.com> wrote:
>
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet. But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks. The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?
Hi,
Thanks for the report!
was this caught by syzbot? If so, can you provide a link?
Also can you provide whether any of the mTHP sysfs settings were enabled?
Based on the report, it looks like we are either dealing with more
lock contention (due to holding the write lock longer). We could
switch to a trylock but that might cause us to lose some collapse
attempts (which will be retried later, so probably fine). I'm ok with
that approach if it prevents these potential regressions.
Cheers,
-- Nico
>
> [ 2749.385719] sysrq: Show Blocked State
> [ 2749.385730] task:khugepaged state:D stack:0 pid:209 tgid:209 ppid:2 task_flags:0x200040 flags:0x00000000
> [ 2749.385735] Call Trace:
> [ 2749.385736] [<0000017f63c8b226>] __schedule+0x316/0x890
> [ 2749.385740] [<0000017f63c8b7dc>] schedule+0x3c/0xc0
> [ 2749.385743] [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
> [ 2749.385746] [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
> [ 2749.385749] [<0000017f63c90910>] down_write+0x70/0x80
> [ 2749.385752] [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
> [ 2749.385755] [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
> [ 2749.385757] [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
> [ 2749.385760] [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
> [ 2749.385762] [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
> [ 2749.385765] [<0000017f63137cb6>] khugepaged+0x226/0x240
> [ 2749.385768] [<0000017f62db3128>] kthread+0x148/0x170
> [ 2749.385770] [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
> [ 2749.385772] [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!
>
^ permalink raw reply
* [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 17:07 UTC (permalink / raw)
To: LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
From: Steven Rostedt <rostedt@goodmis.org>
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118
And then use the raw numbers to dereference:
# echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.
# echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
# echo 1 > events/eprobes/xmit/enable
# cat trace
[..]
sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
- Add error message in parse_btf_args() for failed parsing of TEVENT.
(Sashiko)
- Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
The flag was redundant and added unnecessary complexity.
- Restructure to keep the lifetime of the TYPECAST to the end of
traceprobe_parse_probe_arg_body(). This allows the last_type to stay
around in case there's not a type parameter and then btf can still be
used.
(Sashiko and Masami Hiramatsu)
Documentation/trace/eprobetrace.rst | 4 +
kernel/trace/trace_probe.c | 173 +++++++++++++++++++++++-----
kernel/trace/trace_probe.h | 5 +-
3 files changed, 154 insertions(+), 28 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and "bitfield" are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that when this is used, the FIELD name does not
+ need to be prefixed with a '$'.
Types
-----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 695310571b08..fd1caa1f9723 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
return -ENOENT;
}
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+ struct traceprobe_parse_context *ctx)
+{
+ int ret;
+
+ if (code->data)
+ return -EFAULT;
+ ret = parse_trace_event_arg(arg, code, ctx);
+ if (!ret)
+ return 0;
+ if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+ code->op = FETCH_OP_COMM;
+ return 0;
+ }
+ return -EINVAL;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
&& BTF_INT_BITS(intdata) == 8;
}
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+ return ctx->struct_btf ? : ctx->btf;
+}
+
static int check_prepare_btf_string_fetch(char *typename,
struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
if (!btf || !ctx->last_type)
return 0;
@@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
return 0;
}
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ if (ctx->struct_btf) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ ctx->last_struct = NULL;
+ }
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
if (ctx->btf) {
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
+ bool is_struct = ctx->struct_btf != NULL;
+ struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
s32 tid;
do {
- /* Outer loop for solving arrow operator ('->') */
- if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
- trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
- return -EINVAL;
- }
- /* Convert a struct pointer type to a struct type */
- type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
+ if (!is_struct) {
+ /* Outer loop for solving arrow operator ('->') */
+ if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ /* Convert a struct pointer type to a struct type */
+ type = btf_type_skip_modifiers(btf, type->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
}
+ /* Only the first type can skip being a pointer */
+ is_struct = false;
bitoffs = 0;
do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return is_ptr;
anon_offs = 0;
- field = btf_find_struct_member(ctx->btf, type, fieldname,
+ field = btf_find_struct_member(btf, type, fieldname,
&anon_offs);
if (IS_ERR(field)) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
ctx->last_bitsize = 0;
}
- type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (WARN_ON_ONCE(!ctx->funcname))
+ if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (ctx->flags & TPARG_FL_TEVENT) {
+ ret = parse_trace_event(varname, code, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
+ return ret;
+ }
+ /* TEVENT is only here via a typecast */
+ if (WARN_ON_ONCE(ctx->struct_btf == NULL))
+ return -EINVAL;
+ type = ctx->last_struct;
+ goto found_type;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void */
@@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
static const struct fetch_type *find_fetch_type_from_btf_type(
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
const char *typestr = NULL;
if (btf && ctx->last_type)
@@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
return 0;
}
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+ struct btf *btf = NULL;
+ int id;
+
+ /* A struct_btf should only be used by a single argument */
+ if (WARN_ON_ONCE(ctx->struct_btf)) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ }
+
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
+ ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+ return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ char *tmp;
+ int ret;
+
+ /* Currently this only works for eprobes */
+ if (!(ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+ return -EINVAL;
+ }
+
+ tmp = strchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = query_btf_struct(arg + 1, ctx);
+ *tmp = ')';
+
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ tmp++;
+
+ ctx->offset += tmp - arg;
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
+ return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ ctx->struct_btf = NULL;
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
return 0;
}
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
@@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
int len;
if (ctx->flags & TPARG_FL_TEVENT) {
- if (code->data)
- return -EFAULT;
- ret = parse_trace_event_arg(arg, code, ctx);
- if (!ret)
- return 0;
- if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
- code->op = FETCH_OP_COMM;
- return 0;
- }
- goto inval;
+ if (parse_trace_event(arg, code, ctx) < 0)
+ goto inval;
+ return 0;
}
if (str_has_prefix(arg, "retval")) {
@@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
code->op = FETCH_OP_IMM;
}
break;
+ case '(':
+ ret = handle_typecast(arg, pcode, end, ctx);
+ break;
default:
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
@@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
}
kfree(tmp);
+ /* struct_btf should not be passed to other arguments */
+ clear_struct_btf(ctx);
+
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 1076f1df347b..15758cc11fc6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -422,7 +422,9 @@ struct traceprobe_parse_context {
const struct btf_param *params; /* Parameter of the function */
s32 nr_params; /* The number of the parameters */
struct btf *btf; /* The BTF to be used */
+ struct btf *struct_btf; /* The BTF to be used for structs */
const struct btf_type *last_type; /* Saved type */
+ const struct btf_type *last_struct; /* Saved structure */
u32 last_bitoffs; /* Saved bitoffs */
u32 last_bitsize; /* Saved bitsize */
struct trace_probe *tp;
@@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
- C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
+ C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
#undef C
#define C(a, b) TP_ERR_##a
--
2.53.0
^ permalink raw reply related
* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-01 17:08 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, linux-s390, linux-next
In-Reply-To: <20260601155808.2755103A59-agordeev@linux.ibm.com>
On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet. But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks. The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?
>
> [ 2749.385719] sysrq: Show Blocked State
> [ 2749.385730] task:khugepaged state:D stack:0 pid:209 tgid:209 ppid:2 task_flags:0x200040 flags:0x00000000
> [ 2749.385735] Call Trace:
> [ 2749.385736] [<0000017f63c8b226>] __schedule+0x316/0x890
> [ 2749.385740] [<0000017f63c8b7dc>] schedule+0x3c/0xc0
> [ 2749.385743] [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
> [ 2749.385746] [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
> [ 2749.385749] [<0000017f63c90910>] down_write+0x70/0x80
> [ 2749.385752] [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
> [ 2749.385755] [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
> [ 2749.385757] [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
> [ 2749.385760] [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
> [ 2749.385762] [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
> [ 2749.385765] [<0000017f63137cb6>] khugepaged+0x226/0x240
> [ 2749.385768] [<0000017f62db3128>] kthread+0x148/0x170
> [ 2749.385770] [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
> [ 2749.385772] [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!
Hi Alexander,
Thanks for the report.
It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
a definite issue with the code at v18, all the locks seem balanced internally.
Things it highlighted FWIW:
- Far more mmap_write_lock()'s being taken - the stack-based approach calls
colapse_huge_page() multiple times per-PMD each of which entails an mmap read
lock/unlock and mmap write lock.
- anon_vma write lock held for a much longer period over partial collapse.
So maybe these are triggering issues rather than being the cause of them per-se?
If you happen to see it again could you give the output for:
'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
get more details on it?
Also the .config would be useful.
I'm guessing you've also not enabled mTHP in any way on the system?
Repro-wise you could also:
# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
To get khugepaged going a more aggressively:
$ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done
Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
all --timeout 5m (or maybe something more refined :)?
Maybe some of this will help repro more reliably?
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 17:31 UTC (permalink / raw)
To: LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260601130746.2139d926@gandalf.local.home>
On Mon, 1 Jun 2026 13:07:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
>
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
> (Sashiko)
>
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> The flag was redundant and added unnecessary complexity.
>
> - Restructure to keep the lifetime of the TYPECAST to the end of
> traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> around in case there's not a type parameter and then btf can still be
> used.
> (Sashiko and Masami Hiramatsu)
And I rebased onto probes/for-next
-- Steve
^ permalink raw reply
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-01 17:56 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260529001519.14ca9dbe92fb2622249137c6@kernel.org>
On Fri, May 29, 2026 at 12:15:19AM +0900, Masami Hiramatsu wrote:
> On Wed, 27 May 2026 09:41:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
>
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks Breno, yes, that is what I think about.
> Let me check it. And could you also check Sashiko's comments?
yes, I've spent some time on them, and it reported some good points, in
fact. I will fix those and resend.
Thanks!
--breno
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Andrii Nakryiko @ 2026-06-01 17:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528222051.60b38433@fedora>
On Thu, May 28, 2026 at 7:20 PM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Thu, 28 May 2026 16:01:06 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> >
> > [...]
> >
> > > * Architecture-specific system calls
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index a627acc8fb5f..17042d7e5e87 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> > > #define __NR_rseq_slice_yield 471
> > > __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
> > >
> > > +#define __NR_sframe_register 472
> > > +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> > > +#define __NR_sframe_unregister 473
> > > +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> > > +
> > > #undef __NR_syscalls
> > > -#define __NR_syscalls 472
> > > +#define __NR_syscalls 474
> > >
> > > /*
> > > * 32 bit systems traditionally used different
> > > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> > > new file mode 100644
> > > index 000000000000..d3c9f88b024b
> > > --- /dev/null
> > > +++ b/include/uapi/linux/sframe.h
> > > @@ -0,0 +1,12 @@
> > > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > > +#ifndef _UAPI_LINUX_SFRAME_H
> > > +#define _UAPI_LINUX_SFRAME_H
> > > +
> > > +struct sframe_setup {
> >
> > I'd add `u64 flags;` field for easier and nicer extensibility. Check
> > in the kernel that it is set to zero, future kernels will allow some
> > of the bits to be set.
>
> That sounds reasonable.
>
> >
> > And I still think that prctl() instead of a separate sframe-specific
> > syscall is the way to go. I see no reason for sframe-specific set of
> > syscalls just to set a bit of extra metadata for the entire process.
> > That seems to be the job of prctl().
>
> I personally do not have a preference. I've just heard a lot from
> others where they want to avoid extending an ioctl() like system call
> or even create a new multiplexer syscall.
>
> If we can get a consensus of using prctl() or adding a separate system
> call, I'll go with whatever that is.
prctl() is an already existing multiplexing syscall used to provide
some per-process (of per-thread sometimes, it seems) hints and
options. Please consider sending prctl() extension, please CC me, and
let's see what arguments do people have against extending an already
existing syscall.
>
> >
> > > + __u64 sframe_start;
> > > + __u64 sframe_size;
> > > + __u64 text_start;
> > > + __u64 text_size;
> > > +};
> > > +
> >
> > [...]
> >
> > > +
> > > +/**
> > > + * sys_sframe_register - register an address for user space stacktrace walking.
> > > + * @data: Structure of sframe data used to register the sframe section
> > > + * @size: The size of the given structure.
> > > + *
> > > + * This system call is used by dynamic library utilities to inform the kernel
> > > + * of meta data that it loaded that can be used by the kernel to know how
> > > + * to stack walk the given text locations.
> > > + *
> > > + * Return: 0 if successful, otherwise a negative error.
> > > + */
> > > +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> > > +{
> > > + struct sframe_setup sframe;
> > > +
> > > + if (sizeof(sframe) != size)
> > > + return -EINVAL;
> >
> > This seems overly aggressive. It seems like the pattern is to allow
> > sizes both smaller and bigger:
> > - if user-provided size is smaller than what kernel knows about,
> > treat missing fields as zeroes
>
> Well, that could work with unregister, but for register that isn't
> quite useful, as all fields should be filled (well, if we add flags,
> that may not be 100% true).
>
This is a question of API design. If newly added fields are optional
by default, this works great. And even if you are adding some fields
that in the future will be mandatory (or it could be mandatory based
on flags), then it's super easy to error out if they are not set.
We've been doing this for years now in bpf() syscall and it works
pretty well overall, while also keeping user-space (libbpf, for
instance) side *much* simpler. I don't want to imagine bpf() syscall
which in each kernel version enforces a different size of bpf_attr
union...
> > - if user-provided size is bigger, then check that space after
> > fields that kernel recognizes are all zeroes.
>
> That is dangerous. A zero with greater size could mean something. If
> the size is greater than expected it should simply fail and let user
> space call it again with the older version.
>
Could, but it shouldn't if we extend API reasonably. And if it so
happens that zero will be meaningful, then you add a new flag that has
to be set if that field is present. This is a solved problem.
Requiring user space to use differently-sized structs for different
kernel versions is much-much worse.
> >
> > This allows extensibility without having to change user space code all
> > the time. Old code will provide smaller struct without new (presumably
> > optional) fields, while newer code can use newer and larger struct
> > size, but as long as it clears extra fields old kernel will be fine
> > with that.
>
> The old size will always work, thus old code will always continue to
> work. If we extend the system call, then it must handle both the older
> size as well as the newer size. User space would not need to change. It
> would only change if it wanted to use a new feature, and if it wants to
> work with older kernels it would need to try the bigger size first and
> if that fails, it knows the kernel doesn't support that new feature and
> then user space can figure out what to do. Either use the old system
> call or abort.
See above, many added features are typically optional (e.g., imagine
some extra bits of information that goes along with currently existing
mandatory sframe data). And it's easy to code user space code that can
automatically and gracefully "downgrade" by detecting that kernel
doesn't support some feature and thus just not setting the field,
leaving it zero. But you won't have to track what should be the right
size of the struct which in your API headers is already larger because
you compiled something on newer kernel headers.
Believe me, this is the right way to go with this kind of extendable binary API.
>
> -- Steve
>
> >
> > > +
> > > + if (copy_from_user(&sframe, data, size))
> > > + return -EFAULT;
> > > +
> > > + return sframe_add_section(sframe.sframe_start,
> > > + sframe.sframe_start + sframe.sframe_size,
> > > + sframe.text_start,
> > > + sframe.text_start + sframe.text_size);
> > > +}
> > > +
> >
> > [...]
>
^ permalink raw reply
* [PATCH 1/2] rtla/timerlat: Fix parsing of short options with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
Luis Claudio R . Goncalves, linux-kernel
The timerlat hist command fails to parse short options with attached
numeric arguments (e.g., -p100) due to conflicts between digit characters
used as option values and numeric arguments to other options.
This issue was discovered when testing rtla 7.1.0-rc6 with rteval,
which passes arguments in the compact -p100 format. The rteval tests
failed with the confusing error "no-irq and no-thread set, there is
nothing to do here" even though neither option was specified.
The root cause is two-fold:
1. Digit characters ('0'-'9') were used as short option values for
long-only options like --no-irq, --no-thread, etc. This caused
getopt_auto() to generate an option string like 'a:b:...:u0123456:7:8:9'
which made getopt treat digits as valid option characters.
2. The two-phase option parsing approach (alternating calls between
common_parse_options() and local option parsing) confused getopt's
internal state when encountering arguments like -p100.
When a user passed -p100, getopt would incorrectly parse it as three
separate options: -p, -1, -0, and -0, silently setting no_irq and
no_thread flags instead of recognizing "100" as the period argument.
The two-phase parsing was introduced in commit 850cd24cb6d6 ("tools/rtla:
Add common_parse_options()") which first appeared in v7.0-rc1. Prior to
that commit, -p100 worked correctly. The digit characters as option
values existed since the original timerlat implementation, but only
became problematic when combined with the two-phase parsing approach.
Fix this by:
1. Eliminating digit characters from the option string by filtering them
out in getopt_auto(). This prevents conflicts with numeric arguments.
2. Refactoring timerlat_hist_parse_args() to use single-pass option
parsing. Instead of alternating between common_parse_options() and
local parsing, merge all options (common and local) into a single
option table and parse them in one pass. This matches the approach
used by cyclictest and other tools.
With these changes, all argument formats work correctly:
-p 100 (short with space)
-p100 (short without space)
--period=100 (long with =)
--period 100 (long with space)
This maintains compatibility with existing usage while enabling the
compact -p100 format that users expect from similar tools.
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
tools/tracing/rtla/src/common.c | 4 ++
tools/tracing/rtla/src/timerlat_hist.c | 55 ++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..c2fd051c562c 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -65,6 +65,10 @@ int getopt_auto(int argc, char **argv, const struct option *long_opts)
if (long_opts[i].val < 32 || long_opts[i].val > 127)
continue;
+ /* Skip digit characters to avoid conflicts with numeric arguments */
+ if (long_opts[i].val >= '0' && long_opts[i].val <= '9')
+ continue;
+
if (n + 4 >= sizeof(opts))
fatal("optstring buffer overflow");
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 79142af4f566..c0b6d7c30114 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -787,11 +787,24 @@ static struct common_params
static struct option long_options[] = {
{"auto", required_argument, 0, 'a'},
{"bucket-size", required_argument, 0, 'b'},
+ /* Common options */
+ {"cpus", required_argument, 0, 'c'},
+ {"cgroup", optional_argument, 0, 'C'},
+ {"debug", no_argument, 0, 'D'},
+ {"duration", required_argument, 0, 'd'},
+ {"event", required_argument, 0, 'e'},
+ /* End common options */
{"entries", required_argument, 0, 'E'},
{"help", no_argument, 0, 'h'},
+ /* Common option */
+ {"house-keeping", required_argument, 0, 'H'},
+ /* End common option */
{"irq", required_argument, 0, 'i'},
{"nano", no_argument, 0, 'n'},
{"period", required_argument, 0, 'p'},
+ /* Common option */
+ {"priority", required_argument, 0, 'P'},
+ /* End common option */
{"stack", required_argument, 0, 's'},
{"thread", required_argument, 0, 'T'},
{"trace", optional_argument, 0, 't'},
@@ -819,9 +832,6 @@ static struct common_params
{0, 0, 0, 0}
};
- if (common_parse_options(argc, argv, ¶ms->common))
- continue;
-
c = getopt_auto(argc, argv, long_options);
/* detect the end of the options. */
@@ -850,6 +860,35 @@ static struct common_params
params->common.hist.bucket_size >= 1000000)
fatal("Bucket size needs to be > 0 and <= 1000000");
break;
+ case 'c':
+ if (parse_cpu_set(optarg, ¶ms->common.monitored_cpus))
+ fatal("Invalid -c cpu list");
+ params->common.cpus = optarg;
+ break;
+ case 'C':
+ params->common.cgroup = 1;
+ params->common.cgroup_name = parse_optional_arg(argc, argv);
+ break;
+ case 'D':
+ config_debug = 1;
+ break;
+ case 'd':
+ params->common.duration = parse_seconds_duration(optarg);
+ if (!params->common.duration)
+ fatal("Invalid -d duration");
+ break;
+ case 'e':
+ {
+ struct trace_events *tevent;
+ tevent = trace_event_alloc(optarg);
+ if (!tevent)
+ fatal("Error alloc trace event");
+
+ if (params->common.events)
+ tevent->next = params->common.events;
+ params->common.events = tevent;
+ }
+ break;
case 'E':
params->common.hist.entries = get_llong_from_str(optarg);
if (params->common.hist.entries < 10 ||
@@ -860,6 +899,11 @@ static struct common_params
case '?':
timerlat_hist_usage();
break;
+ case 'H':
+ params->common.hk_cpus = 1;
+ if (parse_cpu_set(optarg, ¶ms->common.hk_cpu_set))
+ fatal("Error parsing house keeping CPUs");
+ break;
case 'i':
params->common.stop_us = get_llong_from_str(optarg);
break;
@@ -874,6 +918,11 @@ static struct common_params
if (params->timerlat_period_us > 1000000)
fatal("Period longer than 1 s");
break;
+ case 'P':
+ if (parse_prio(optarg, ¶ms->common.sched_param) == -1)
+ fatal("Invalid -P priority");
+ params->common.set_sched = 1;
+ break;
case 's':
params->print_stack = get_llong_from_str(optarg);
break;
--
2.54.0
^ permalink raw reply related
* [PATCH 2/2] rtla/timerlat: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
Luis Claudio R . Goncalves, linux-kernel
In-Reply-To: <20260601211538.381649-1-jkacur@redhat.com>
Add tests to verify that numeric arguments work correctly with both
attached and detached formats:
-p 100 (short with space)
-p100 (short without space)
--period=100 (long with =)
--period 100 (long with space)
These tests prevent regression of the bug fixed in commit eefa8af46ff7
("rtla/timerlat: Fix parsing of short options with attached arguments")
where -p100 was incorrectly parsed as multiple separate options.
The tests verify that:
1. All four argument formats succeed (exit code 0)
2. None trigger the "no-irq and no-thread" error that occurred when
the bug was present
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
tools/tracing/rtla/tests/timerlat.t | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..1a63301f5d70 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -42,6 +42,16 @@ check "verify -c/--cpus" \
check "hist test in nanoseconds" \
"timerlat hist -i 2 -c 0 -n -d 10s" 2 "ns"
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+ "timerlat hist -p 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+ "timerlat hist -p100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+ "timerlat hist --period=100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+ "timerlat hist --period 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+
# Actions tests
check "trace output through -t" \
"timerlat hist -T 2 -t" 2 "^ Saving trace to timerlat_trace.txt$"
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v7 09/42] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-06-01 23:14 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-9-2f0fae496530@google.com>
On Fri, May 22, 2026 at 05:17:51PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
Typo on the "person".
(Sent this earlier but looks like some of my emails never hit the
list so re-sending. Apologies if this is a dupe).
Thanks,
Mike
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
^ permalink raw reply
* [PATCH] tracing/events: Expand ring buffer for in-kernel event enables
From: Manjunath Patil @ 2026-06-01 23:24 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
Manjunath Patil
Ftrace keeps trace arrays at a boot-minimum ring-buffer size until
tracing is used. Tracefs event-enable paths already call
tracing_update_buffers() before enabling events, but the exported
in-kernel helpers trace_set_clr_event() and trace_array_set_clr_event()
directly enable events through __ftrace_set_clr_event().
This can leave events enabled by in-kernel users recording into the tiny
boot-minimum buffer instead of the configured default-sized buffer. Any
caller that enables events through these exported helpers observes
different buffer-expansion behavior than a userspace tracefs event enable.
Expand the relevant trace array before enabling events through the
exported in-kernel helpers, matching the tracefs event-enable behavior.
Disabling events remains unchanged.
Assisted-by: Codex:gpt-5
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
---
kernel/trace/trace_events.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..3ce5b0121c5c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1479,10 +1479,22 @@ int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set)
int trace_set_clr_event(const char *system, const char *event, int set)
{
struct trace_array *tr = top_trace_array();
+ int ret;
if (!tr)
return -ENODEV;
+ /*
+ * Keep in-kernel event enabling consistent with tracefs event
+ * enabling: once an event is being enabled, expand the boot-minimum
+ * ring buffer to the configured default size before records arrive.
+ */
+ if (set) {
+ ret = tracing_update_buffers(tr);
+ if (ret < 0)
+ return ret;
+ }
+
return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
}
EXPORT_SYMBOL_GPL(trace_set_clr_event);
@@ -1504,11 +1516,24 @@ int trace_array_set_clr_event(struct trace_array *tr, const char *system,
const char *event, bool enable)
{
int set;
+ int ret;
if (!tr)
return -ENOENT;
set = (enable == true) ? 1 : 0;
+
+ /*
+ * Keep in-kernel event enabling consistent with tracefs event
+ * enabling: once an event is being enabled, expand the boot-minimum
+ * ring buffer to the configured default size before records arrive.
+ */
+ if (set) {
+ ret = tracing_update_buffers(tr);
+ if (ret < 0)
+ return ret;
+ }
+
return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
}
EXPORT_SYMBOL_GPL(trace_array_set_clr_event);
base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.47.3
^ permalink raw reply related
* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 0:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260601122126.5ebbd7e7@fedora>
On Mon, 1 Jun 2026 12:21:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sun, 31 May 2026 10:14:58 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > > Does this prematurely release the BTF struct reference?
> > > > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > > > later steps in traceprobe_parse_probe_arg_body() (like
> > > > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > > > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > > > will evaluate to ctx->btf (which is NULL for eprobes).
> > > > Could this potentially lead to silent defaults, such as 64-bit reads for
> > > > smaller fields, or fail to inject pointer dereferences for string fields,
> > > > while also leaving ctx->last_type pointing to a prematurely released BTF
> > > > object?
> > >
> > > Does this mean we need to set ctx->last_type to NULL here too?
> >
> > No, since the member we refer can be different from unsigned long.
> > When we don't have ":type" suffix, we use BTF type information to
> > decide appropriate type.
> >
> > >
> > > Because everything above is pretty much the expected behavior. The put is
> > > *not* premature. The last_struct and struct_btf are both set to NULL. I
> > > guess the only thing missing is to reset last_type as well.
> >
> > No, as I explained, the last_type is used to determine the member type
> > when user does not specify the ":type" suffix.
> >
> > So, what we need to do is deferring the btf_put(struct_btf) as below:
> > (no build test yet.)
>
> OK, but I don't think we want the struct_btf to exist beyond a single
> arg like the btf descriptor does. How about this (on top of this change),
> where it clears the struct_btf at the end of traceprobe_parse_probe_arg_body()?
>
> Also, I see the flag as being redundant and use the existence of
> struct_btf to denote that it's parsing a typedef struct.
Ah, indeed. OK, let me check v8 patch.
Thanks!
>
> -- Steve
>
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 9246e9c3d066..56b7dc406ca1 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -397,8 +397,7 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
>
> static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> {
> - return ctx->flags & TPARG_FL_TYPECAST ?
> - ctx->struct_btf : ctx->btf;
> + return ctx->struct_btf ? : ctx->btf;
> }
>
> static int check_prepare_btf_string_fetch(char *typename,
> @@ -531,6 +530,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
> return 0;
> }
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + ctx->last_struct = NULL;
> + }
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> if (ctx->btf) {
> @@ -579,7 +587,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> struct fetch_insn *code = *pcode;
> const struct btf_member *field;
> u32 bitoffs, anon_offs;
> - bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
> + bool is_struct = ctx->struct_btf != NULL;
> struct btf *btf = ctx_btf(ctx);
> char *next;
> int is_ptr;
> @@ -690,7 +698,7 @@ static int parse_btf_arg(char *varname,
> ret = parse_trace_event(varname, code, ctx);
> if (ret < 0)
> return ret;
> - if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> + if (WARN_ON_ONCE(ctx->struct_btf == NULL))
> return -EINVAL;
> type = ctx->last_struct;
> goto found_type;
> @@ -804,21 +812,19 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
>
> static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> {
> + struct btf *btf = NULL;
> int id;
>
> - if (!ctx->struct_btf) {
> - struct btf *btf;
> -
> - id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> - if (id < 0)
> - return id;
> - ctx->struct_btf = btf;
> - } else {
> - id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
> - if (id < 0)
> - return id;
> + /* Could be a for a structure in a different module */
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> }
>
> + id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> + if (id < 0)
> + return id;
> + ctx->struct_btf = btf;
> ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> return 0;
> }
> @@ -848,25 +854,23 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
>
> if (ret < 0) {
> trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> - ret = -EINVAL;
> - goto out_put;
> + return -EINVAL;
> }
>
> - ctx->flags |= TPARG_FL_TYPECAST;
> tmp++;
>
> ctx->offset += tmp - arg;
> ret = parse_btf_arg(tmp, pcode, end, ctx);
> - ctx->flags &= ~TPARG_FL_TYPECAST;
> - ctx->last_struct = NULL;
> -out_put:
> - btf_put(ctx->struct_btf);
> - ctx->struct_btf = NULL;
> return ret;
> }
>
> #else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + ctx->struct_btf = NULL;
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> ctx->btf = NULL;
> @@ -1673,6 +1677,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> }
> kfree(tmp);
>
> + /* struct_btf should not be passed to other arguments */
> + clear_struct_btf(ctx);
> +
> return ret;
> }
>
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 952e3d7582b8..83565f1634db 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -394,7 +394,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
> * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
> * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
> * TPARG_FL_KERNEL.
> - * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
> */
> #define TPARG_FL_RETURN BIT(0)
> #define TPARG_FL_KERNEL BIT(1)
> @@ -403,7 +402,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
> #define TPARG_FL_USER BIT(4)
> #define TPARG_FL_FPROBE BIT(5)
> #define TPARG_FL_TPOINT BIT(6)
> -#define TPARG_FL_TYPECAST BIT(7)
> #define TPARG_FL_LOC_MASK GENMASK(4, 0)
>
> static inline bool tparg_is_function_entry(unsigned int flags)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 0:06 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
Ian Rogers, Jiri Olsa
In-Reply-To: <20260601133129.4a1e9dec@gandalf.local.home>
On Mon, 1 Jun 2026 13:31:29 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 1 Jun 2026 13:07:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> >
> > - Add error message in parse_btf_args() for failed parsing of TEVENT.
> > (Sashiko)
> >
> > - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> > The flag was redundant and added unnecessary complexity.
> >
> > - Restructure to keep the lifetime of the TYPECAST to the end of
> > traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> > around in case there's not a type parameter and then btf can still be
> > used.
> > (Sashiko and Masami Hiramatsu)
>
> And I rebased onto probes/for-next
>
Thanks, but it seems Sashiko failed to apply (because it is using
linux-trace/HEAD branch?) Hmm, we may always need "base-id" tag.
Thanks,
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-02 0:25 UTC (permalink / raw)
To: LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
From: Steven Rostedt <rostedt@goodmis.org>
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118
And then use the raw numbers to dereference:
# echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.
# echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
# echo 1 > events/eprobes/xmit/enable
# cat trace
[..]
sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
[ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]
base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129
Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
- Add error message in parse_btf_args() for failed parsing of TEVENT.
(Sashiko)
- Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
The flag was redundant and added unnecessary complexity.
- Restructure to keep the lifetime of the TYPECAST to the end of
traceprobe_parse_probe_arg_body(). This allows the last_type to stay
around in case there's not a type parameter and then btf can still be
used.
(Sashiko and Masami Hiramatsu)
Documentation/trace/eprobetrace.rst | 4 +
kernel/trace/trace_probe.c | 173 +++++++++++++++++++++++-----
kernel/trace/trace_probe.h | 5 +-
3 files changed, 154 insertions(+), 28 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and "bitfield" are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that when this is used, the FIELD name does not
+ need to be prefixed with a '$'.
Types
-----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 695310571b08..fd1caa1f9723 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
return -ENOENT;
}
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+ struct traceprobe_parse_context *ctx)
+{
+ int ret;
+
+ if (code->data)
+ return -EFAULT;
+ ret = parse_trace_event_arg(arg, code, ctx);
+ if (!ret)
+ return 0;
+ if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+ code->op = FETCH_OP_COMM;
+ return 0;
+ }
+ return -EINVAL;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
&& BTF_INT_BITS(intdata) == 8;
}
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+ return ctx->struct_btf ? : ctx->btf;
+}
+
static int check_prepare_btf_string_fetch(char *typename,
struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
if (!btf || !ctx->last_type)
return 0;
@@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
return 0;
}
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ if (ctx->struct_btf) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ ctx->last_struct = NULL;
+ }
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
if (ctx->btf) {
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
+ bool is_struct = ctx->struct_btf != NULL;
+ struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
s32 tid;
do {
- /* Outer loop for solving arrow operator ('->') */
- if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
- trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
- return -EINVAL;
- }
- /* Convert a struct pointer type to a struct type */
- type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
+ if (!is_struct) {
+ /* Outer loop for solving arrow operator ('->') */
+ if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ /* Convert a struct pointer type to a struct type */
+ type = btf_type_skip_modifiers(btf, type->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
}
+ /* Only the first type can skip being a pointer */
+ is_struct = false;
bitoffs = 0;
do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return is_ptr;
anon_offs = 0;
- field = btf_find_struct_member(ctx->btf, type, fieldname,
+ field = btf_find_struct_member(btf, type, fieldname,
&anon_offs);
if (IS_ERR(field)) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
ctx->last_bitsize = 0;
}
- type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (WARN_ON_ONCE(!ctx->funcname))
+ if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (ctx->flags & TPARG_FL_TEVENT) {
+ ret = parse_trace_event(varname, code, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
+ return ret;
+ }
+ /* TEVENT is only here via a typecast */
+ if (WARN_ON_ONCE(ctx->struct_btf == NULL))
+ return -EINVAL;
+ type = ctx->last_struct;
+ goto found_type;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void */
@@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
static const struct fetch_type *find_fetch_type_from_btf_type(
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
const char *typestr = NULL;
if (btf && ctx->last_type)
@@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
return 0;
}
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+ struct btf *btf = NULL;
+ int id;
+
+ /* A struct_btf should only be used by a single argument */
+ if (WARN_ON_ONCE(ctx->struct_btf)) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ }
+
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
+ ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+ return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ char *tmp;
+ int ret;
+
+ /* Currently this only works for eprobes */
+ if (!(ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+ return -EINVAL;
+ }
+
+ tmp = strchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = query_btf_struct(arg + 1, ctx);
+ *tmp = ')';
+
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ tmp++;
+
+ ctx->offset += tmp - arg;
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
+ return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ ctx->struct_btf = NULL;
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
return 0;
}
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
@@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
int len;
if (ctx->flags & TPARG_FL_TEVENT) {
- if (code->data)
- return -EFAULT;
- ret = parse_trace_event_arg(arg, code, ctx);
- if (!ret)
- return 0;
- if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
- code->op = FETCH_OP_COMM;
- return 0;
- }
- goto inval;
+ if (parse_trace_event(arg, code, ctx) < 0)
+ goto inval;
+ return 0;
}
if (str_has_prefix(arg, "retval")) {
@@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
code->op = FETCH_OP_IMM;
}
break;
+ case '(':
+ ret = handle_typecast(arg, pcode, end, ctx);
+ break;
default:
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
@@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
}
kfree(tmp);
+ /* struct_btf should not be passed to other arguments */
+ clear_struct_btf(ctx);
+
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 1076f1df347b..15758cc11fc6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -422,7 +422,9 @@ struct traceprobe_parse_context {
const struct btf_param *params; /* Parameter of the function */
s32 nr_params; /* The number of the parameters */
struct btf *btf; /* The BTF to be used */
+ struct btf *struct_btf; /* The BTF to be used for structs */
const struct btf_type *last_type; /* Saved type */
+ const struct btf_type *last_struct; /* Saved structure */
u32 last_bitoffs; /* Saved bitoffs */
u32 last_bitsize; /* Saved bitsize */
struct trace_probe *tp;
@@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
- C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
+ C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
#undef C
#define C(a, b) TP_ERR_##a
--
2.53.0
^ permalink raw reply related
* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lance Yang @ 2026-06-02 1:53 UTC (permalink / raw)
To: Lorenzo Stoakes, Alexander Gordeev
Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
linux-s390, linux-next
In-Reply-To: <ah2z26OzPktchVeT@lucifer>
On 2026/6/2 01:08, Lorenzo Stoakes wrote:
> On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
>> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>>
>> Hi Andrew et al,
>>
>>> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
>>>
>>>> The following series provides khugepaged with the capability to collapse
>>>> anonymous memory regions to mTHPs.
>>>
>>> Thanks, I've update mm.git's mm-unstable branch to this version.
>>>
>>> It sounds like I might be dropping it soon, haven't started looking at
>>> that yet. But let's at least eyeball the latest version at this time.
>>>
>>> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
>>> well, thanks. The AI checking made a few allegations:
>>
>> This series appears to cause hangs on s390 in linux-next.
>> The issue is not easily reproducible, so it is not yet confirmed.
>> Any ideas for a reliable reproducer that exercises the code path below?
>>
>> [ 2749.385719] sysrq: Show Blocked State
>> [ 2749.385730] task:khugepaged state:D stack:0 pid:209 tgid:209 ppid:2 task_flags:0x200040 flags:0x00000000
>> [ 2749.385735] Call Trace:
>> [ 2749.385736] [<0000017f63c8b226>] __schedule+0x316/0x890
>> [ 2749.385740] [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>> [ 2749.385743] [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>> [ 2749.385746] [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>> [ 2749.385749] [<0000017f63c90910>] down_write+0x70/0x80
>> [ 2749.385752] [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>> [ 2749.385755] [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>> [ 2749.385757] [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>> [ 2749.385760] [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>> [ 2749.385762] [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>> [ 2749.385765] [<0000017f63137cb6>] khugepaged+0x226/0x240
>> [ 2749.385768] [<0000017f62db3128>] kthread+0x148/0x170
>> [ 2749.385770] [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>> [ 2749.385772] [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>>
>> Thanks!
>
> Hi Alexander,
>
> Thanks for the report.
>
> It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
> a definite issue with the code at v18, all the locks seem balanced internally.
>
> Things it highlighted FWIW:
>
> - Far more mmap_write_lock()'s being taken - the stack-based approach calls
> colapse_huge_page() multiple times per-PMD each of which entails an mmap read
> lock/unlock and mmap write lock.
>
> - anon_vma write lock held for a much longer period over partial collapse.
>
> So maybe these are triggering issues rather than being the cause of them per-se?
>
> If you happen to see it again could you give the output for:
>
> 'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
> get more details on it?
>
> Also the .config would be useful.
>
> I'm guessing you've also not enabled mTHP in any way on the system?
>
> Repro-wise you could also:
>
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>
> To get khugepaged going a more aggressively:
>
> $ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done
>
> Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
> all --timeout 5m (or maybe something more refined :)?
>
> Maybe some of this will help repro more reliably?
>
Cool!
Maybe also worth trying with CONFIG_DETECT_HUNG_TASK=y and
CONFIG_DETECT_HUNG_TASK_BLOCKER=y.
# detect after 10s in D state instead of default 120s
echo 10 > /proc/sys/kernel/hung_task_timeout_secs
# optional: check more often; 0 means same as timeout
echo 0 > /proc/sys/kernel/hung_task_check_interval_secs
With that enabled, the kernel should hopefully tell us which task likely
owns the rwsem. If it is writer-owned, I would expect that to be fairly
reliable.
Cheers, Lance
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-02 2:16 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ahOqzpzAua96HVkn@gourry-fedora-PF4VCD3F>
On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> On Thu, May 21, 2026 at 04:23:28PM +1000, Balbir Singh wrote:
> > On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> > > Topic type: MM
> > >
> > > Presenter: Gregory Price <gourry@gourry.net>
> > >
> > > This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> > > managed by the buddy allocator but excluded from normal allocations.
> > >
> > > I present it with an end-to-end Compressed RAM service (mm/cram.c)
> > > that would otherwise not be possible (or would be considerably more
> > > difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> > >
> >
> > Do we have updates/notes from the meeting?
> >
>
> I have been on leave since LSF, but I do have some notes posted:
>
> https://lore.kernel.org/linux-mm/af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F/
> https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/
>
> I will be trying to post an updated set stripped down without the GFP
> flag as a first pass w/o RFC tags and no UAPI implications so that
> device folks can play with this upstream.
>
> I'm debating on whether to include OPS_MEMPOLICY in the initial version
> if only because it's not intuitive how it interacts with pagecache. That
> needs more time to bake.
>
It makes sense to look at it and then decide if it makes sense.
> > >
> > > page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
> >
> > Do we want to provide kernel level control over allocation of private
> > pages, I assumed that only user space applications? I would assume
> > node affinity would be the way to do so, unless we have multiple
> >
>
> alloc_pages_node() is the kernel interface
I was think we wouldn't need explicit flags and that allocations would
happen from user space using __GFP_THISNODE to the node or via a nodemask
based on nodes of interest. Is there a reason to add this flag, a system
might have more than one source of N_MEMORY_PRIVATE?
>
> > >
> > > /* Ok but I want to do something useful with it */
> > > static const struct node_private_ops ops = {
> > > .migrate_to = my_migrate_to,
> > > .folio_migrate = my_folio_migrate,
> > > .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> > > };
> > > node_private_set_ops(nid, &ops);
> > >
> >
> > Could you explain this further? Why does OPS_MIGRATION
> > and OPS_MEMPOLICY needs to be set explictly?
> >
>
> Both of these have been removed from the upcoming version, but in this
> RFC version i was testing OPS_MIGRATION as an explicit flag that meant
> "migrate.c can touch the folios" while OPS_MEMPOLICY meant "mempolicy.c
> can touch the folios".
>
> As it turns out, OPS_MIGRATION is not a useful filter, as it doesn't
> actually filter anything (anything using OPS_MIGRATION would also need
> its own filter flag, so better to just drop it and do per-server
> opt-ins).
>
Thanks,
Balbir
^ permalink raw reply
* Re: [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 2:28 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
Ian Rogers, Jiri Olsa
In-Reply-To: <20260601202546.564e867b@gandalf.local.home>
On Mon, 1 Jun 2026 20:25:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add syntax to the parsing of eprobes to be able to typecast a trace event
> field that is a pointer to a structure.
>
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> dereference.
>
> But for event probes that records a field that happens to be a pointer to
> a structure, it cannot dereference these values with BTF naming, but
> must use numerical offsets.
>
> For example, to find out what device a sk_buff is pointing to in the
> net_dev_xmit trace event, one must first use gdb to find the offsets of the
> members of the structures:
>
> (gdb) p &((struct sk_buff *)0)->dev
> $1 = (struct net_device **) 0x10
> (gdb) p &((struct net_device *)0)->name
> $2 = (char (*)[16]) 0x118
>
> And then use the raw numbers to dereference:
>
> # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
>
> If BTF is in the kernel, then instead, the skbaddr can be typecast to
> sk_buff and use the normal dereference logic.
>
> # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
> # echo 1 > events/eprobes/xmit/enable
> # cat trace
> [..]
> sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
>
> The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
>
> Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> to know what they are for.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>
> [ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]
Sashiko still faailed to apply this... Not sure why.
https://sashiko.dev/#/message/20260601202546.564e867b%40gandalf.local.home
Maybe better to configure Sashiko via github or sashiko-ml?
https://github.com/sashiko-dev/sashiko/blob/main/MAINTAINERS_GUIDE.md
Anyway, at least for me, this looks good.
Thanks,
>
> base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129
>
> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
>
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
> (Sashiko)
>
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> The flag was redundant and added unnecessary complexity.
>
> - Restructure to keep the lifetime of the TYPECAST to the end of
> traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> around in case there's not a type parameter and then btf can still be
> used.
> (Sashiko and Masami Hiramatsu)
>
> Documentation/trace/eprobetrace.rst | 4 +
> kernel/trace/trace_probe.c | 173 +++++++++++++++++++++++-----
> kernel/trace/trace_probe.h | 5 +-
> 3 files changed, 154 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
> index 89b5157cfab8..fe3602540569 100644
> --- a/Documentation/trace/eprobetrace.rst
> +++ b/Documentation/trace/eprobetrace.rst
> @@ -46,6 +46,10 @@ Synopsis of eprobe_events
> (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
> "string", "ustring", "symbol", "symstr" and "bitfield" are
> supported.
> + (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
> + a pointer to STRUCT and then derference the pointer defined by
> + ->MEMBER. Note that when this is used, the FIELD name does not
> + need to be prefixed with a '$'.
>
> Types
> -----
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 695310571b08..fd1caa1f9723 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
> return -ENOENT;
> }
>
> +static int parse_trace_event(char *arg, struct fetch_insn *code,
> + struct traceprobe_parse_context *ctx)
> +{
> + int ret;
> +
> + if (code->data)
> + return -EFAULT;
> + ret = parse_trace_event_arg(arg, code, ctx);
> + if (!ret)
> + return 0;
> + if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> + code->op = FETCH_OP_COMM;
> + return 0;
> + }
> + return -EINVAL;
> +}
> +
> #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
>
> static u32 btf_type_int(const struct btf_type *t)
> @@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
> && BTF_INT_BITS(intdata) == 8;
> }
>
> +static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> +{
> + return ctx->struct_btf ? : ctx->btf;
> +}
> +
> static int check_prepare_btf_string_fetch(char *typename,
> struct fetch_insn **pcode,
> struct traceprobe_parse_context *ctx)
> {
> - struct btf *btf = ctx->btf;
> + struct btf *btf = ctx_btf(ctx);
>
> if (!btf || !ctx->last_type)
> return 0;
> @@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
> return 0;
> }
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + ctx->last_struct = NULL;
> + }
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> if (ctx->btf) {
> @@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> struct fetch_insn *code = *pcode;
> const struct btf_member *field;
> u32 bitoffs, anon_offs;
> + bool is_struct = ctx->struct_btf != NULL;
> + struct btf *btf = ctx_btf(ctx);
> char *next;
> int is_ptr;
> s32 tid;
>
> do {
> - /* Outer loop for solving arrow operator ('->') */
> - if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> - trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> - return -EINVAL;
> - }
> - /* Convert a struct pointer type to a struct type */
> - type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
> - if (!type) {
> - trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> - return -EINVAL;
> + if (!is_struct) {
> + /* Outer loop for solving arrow operator ('->') */
> + if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> + trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> + return -EINVAL;
> + }
> +
> + /* Convert a struct pointer type to a struct type */
> + type = btf_type_skip_modifiers(btf, type->type, &tid);
> + if (!type) {
> + trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> + return -EINVAL;
> + }
> }
> + /* Only the first type can skip being a pointer */
> + is_struct = false;
>
> bitoffs = 0;
> do {
> @@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> return is_ptr;
>
> anon_offs = 0;
> - field = btf_find_struct_member(ctx->btf, type, fieldname,
> + field = btf_find_struct_member(btf, type, fieldname,
> &anon_offs);
> if (IS_ERR(field)) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> @@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> ctx->last_bitsize = 0;
> }
>
> - type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
> + type = btf_type_skip_modifiers(btf, field->type, &tid);
> if (!type) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> return -EINVAL;
> @@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
> int i, is_ptr, ret;
> u32 tid;
>
> - if (WARN_ON_ONCE(!ctx->funcname))
> + if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
> return -EINVAL;
>
> is_ptr = split_next_field(varname, &field, ctx);
> @@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
> return -EOPNOTSUPP;
> }
>
> + if (ctx->flags & TPARG_FL_TEVENT) {
> + ret = parse_trace_event(varname, code, ctx);
> + if (ret < 0) {
> + trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
> + return ret;
> + }
> + /* TEVENT is only here via a typecast */
> + if (WARN_ON_ONCE(ctx->struct_btf == NULL))
> + return -EINVAL;
> + type = ctx->last_struct;
> + goto found_type;
> + }
> +
> if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
> code->op = FETCH_OP_RETVAL;
> /* Check whether the function return type is not void */
> @@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
>
> found:
> type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
> +found_type:
> if (!type) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> return -EINVAL;
> @@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
> static const struct fetch_type *find_fetch_type_from_btf_type(
> struct traceprobe_parse_context *ctx)
> {
> - struct btf *btf = ctx->btf;
> + struct btf *btf = ctx_btf(ctx);
> const char *typestr = NULL;
>
> if (btf && ctx->last_type)
> @@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
> return 0;
> }
>
> -#else
> +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> +{
> + struct btf *btf = NULL;
> + int id;
> +
> + /* A struct_btf should only be used by a single argument */
> + if (WARN_ON_ONCE(ctx->struct_btf)) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + }
> +
> + id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> + if (id < 0)
> + return id;
> + ctx->struct_btf = btf;
> + ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> + return 0;
> +}
> +
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> + struct fetch_insn *end,
> + struct traceprobe_parse_context *ctx)
> +{
> + char *tmp;
> + int ret;
> +
> + /* Currently this only works for eprobes */
> + if (!(ctx->flags & TPARG_FL_TEVENT)) {
> + trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> + return -EINVAL;
> + }
> +
> + tmp = strchr(arg, ')');
> + if (!tmp) {
> + trace_probe_log_err(ctx->offset + strlen(arg),
> + DEREF_OPEN_BRACE);
> + return -EINVAL;
> + }
> + *tmp = '\0';
> + ret = query_btf_struct(arg + 1, ctx);
> + *tmp = ')';
> +
> + if (ret < 0) {
> + trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> + return -EINVAL;
> + }
> +
> + tmp++;
> +
> + ctx->offset += tmp - arg;
> + ret = parse_btf_arg(tmp, pcode, end, ctx);
> + return ret;
> +}
> +
> +#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
> +
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + ctx->struct_btf = NULL;
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> ctx->btf = NULL;
> @@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
> return 0;
> }
>
> -#endif
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> + struct fetch_insn *end,
> + struct traceprobe_parse_context *ctx)
> +{
> + trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
> + return -EOPNOTSUPP;
> +}
> +
> +#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
>
> #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
>
> @@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
> int len;
>
> if (ctx->flags & TPARG_FL_TEVENT) {
> - if (code->data)
> - return -EFAULT;
> - ret = parse_trace_event_arg(arg, code, ctx);
> - if (!ret)
> - return 0;
> - if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> - code->op = FETCH_OP_COMM;
> - return 0;
> - }
> - goto inval;
> + if (parse_trace_event(arg, code, ctx) < 0)
> + goto inval;
> + return 0;
> }
>
> if (str_has_prefix(arg, "retval")) {
> @@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
> code->op = FETCH_OP_IMM;
> }
> break;
> + case '(':
> + ret = handle_typecast(arg, pcode, end, ctx);
> + break;
> default:
> if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
> if (!tparg_is_function_entry(ctx->flags) &&
> @@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> }
> kfree(tmp);
>
> + /* struct_btf should not be passed to other arguments */
> + clear_struct_btf(ctx);
> +
> return ret;
> }
>
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 1076f1df347b..15758cc11fc6 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -422,7 +422,9 @@ struct traceprobe_parse_context {
> const struct btf_param *params; /* Parameter of the function */
> s32 nr_params; /* The number of the parameters */
> struct btf *btf; /* The BTF to be used */
> + struct btf *struct_btf; /* The BTF to be used for structs */
> const struct btf_type *last_type; /* Saved type */
> + const struct btf_type *last_struct; /* Saved structure */
> u32 last_bitoffs; /* Saved bitoffs */
> u32 last_bitsize; /* Saved bitsize */
> struct trace_probe *tp;
> @@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
> C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
> C(TOO_MANY_ARGS, "Too many arguments are specified"), \
> C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
> - C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
> + C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
> + C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
>
> #undef C
> #define C(a, b) TP_ERR_##a
> --
> 2.53.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-02 3:08 UTC (permalink / raw)
To: David Hildenbrand (Arm), Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <e3d023f1-ab6e-4424-b304-55f1294480c3@kernel.org>
On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
> On 6/1/26 14:28, Miaohe Lin wrote:
>> On 2026/5/27 22:06, Breno Leitao wrote:
>>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>>> -> retry path. That is correct for the transient case (a userspace
>>> folio briefly off LRU during migration or compaction, which a later
>>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>>> page-table, large-kmalloc and PG_reserved pages will never become
>>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>>> -EIO loses the "this is structurally unrecoverable" information.
>>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>>> panic-on-unrecoverable sysctl deliberately does not act on.
>>>
>>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>>> identifies pages the hwpoison handler cannot recover from:
>>>
>>> HWPoisonKernelOwned(p, flags) :=
>>> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>> (PageReserved(p) || PageSlab(p) ||
>>> PageTable(p) || PageLargeKmalloc(p))
>>>
>>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>>> migrate movable_ops pages even though they are not on the LRU, and
>>> we must not pre-empt that with an unrecoverable verdict.
>>>
>>> The list is intentionally not exhaustive. vmalloc and kernel-stack
>>> pages, for example, do not carry a page_type bit and would need a
>>> different oracle; they keep going through the existing retry path
>>> unchanged. This is the smallest set we can identify with certainty
>>> by page type.
>>>
>>> Wire the helper into the top of get_any_page() to short-circuit
>>> those pages before the retry loop runs. On a hit, drop the caller's
>>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>>> straight away. Pages outside the helper's positive list still take
>>> the existing retry path and return -EIO, leaving operator-visible
>>> behaviour for those cases unchanged.
>>>
>>> Extend the unhandlable-page pr_err() to fire for either errno and
>>> update the get_hwpoison_page() kerneldoc to document the new return.
>>>
>>> memory_failure() still folds every negative return into
>>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>>> this patch on its own only changes the errno that soft_offline_page()
>>> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
>>> through memory_failure() and reports MF_MSG_KERNEL for the
>>> unrecoverable cases, which is what the
>>> panic_on_unrecoverable_memory_failure sysctl observes.
>>
>> Thanks for your patch.
>>
>>>
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f4d3e6e20e13..8f63bdfeff8f 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>> return PageLRU(page) || is_free_buddy_page(page);
>>> }
>>>
>>> +/*
>>> + * Positive identification of pages the hwpoison handler cannot recover.
>>> + * These page types are owned by kernel internals (no userspace mapping
>>> + * to unmap, no file mapping to invalidate, no migration target), so the
>>> + * shake_page() / retry loop in get_any_page() can never turn them into
>>> + * something HWPoisonHandlable() will accept. Short-circuit them to
>>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>>> + * spinning through retries that exit as a transient-looking -EIO.
>>> + *
>>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>>> + * pages even though they are not on the LRU.
>>> + */
>>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>>> +{
>>> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>>> + return false;
>>> +
>>> + return PageReserved(page) || PageSlab(page) ||
>>
>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>
>>> + PageTable(page) || PageLargeKmalloc(page);
>>
>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>> PageTable and PageLargeKmalloc without extra page refcnt?
>
> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
> PageLargeKmalloc).
Got it. Thanks.
> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
> allow checking it on compound pages.
It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
>
> For PageLargeKmalloc, we would want to check the head page, though. The page
> type is only stored for the head page.
Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
on folio.
>
> So maybe we want to lookup the compound head (if any) and perform the type
> checks against that?
Maybe we should or we might miss some pages that could have been handled. And
if compound head is required, should we hold an extra page refcnt to guard against
possible folio split race?
Thanks.
.
^ permalink raw reply
* Re: [PATCH v8 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Miaohe Lin @ 2026-06-02 3:31 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-3-9ea0cfa16bb0@debian.org>
On 2026/5/27 22:06, Breno Leitao wrote:
> The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
> for stable unhandlable kernel pages (PG_reserved, slab, page tables,
> large-kmalloc). memory_failure() still folds every negative return
> into MF_MSG_GET_HWPOISON, so callers that want to react to the
> unrecoverable cases (a panic option, smarter logging) cannot tell
> them apart from transient page-allocator races.
>
> Turn the post-call branch into a switch over the get_hwpoison_page()
> return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
> negative return to MF_MSG_GET_HWPOISON. case 0 keeps the existing
> free-buddy / kernel-high-order handling and case 1 falls through to
> the rest of memory_failure() unchanged.
>
> The MF_MSG_KERNEL label and tracepoint string are kept as
> "reserved kernel page" to avoid breaking userspace tools that match
> on those literals; the enum value still adequately tags the failure
> even though it now also covers slab, page tables and large-kmalloc
> pages.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.
^ permalink raw reply
* Re: [PATCH v8 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Miaohe Lin @ 2026-06-02 7:05 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-4-9ea0cfa16bb0@debian.org>
On 2026/5/27 22:06, Breno Leitao wrote:
> Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
> default) that triggers a kernel panic when memory_failure()
> encounters pages that cannot be recovered. This provides a clean
> crash with useful debug information rather than allowing silent
> data corruption or a delayed crash at an unrelated code path.
>
> Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
> result == MF_IGNORED panics. After the previous patch, MF_MSG_KERNEL
> covers PG_reserved pages and the kernel-owned pages promoted from
> get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
> large-kmalloc).
>
> All other action types are excluded:
>
> - MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
> transient refcount races with the page allocator (an in-flight buddy
> allocation has refcount 0 and is no longer on the buddy free list,
> briefly), and panicking on them would risk killing the box for what
> is actually a recoverable userspace page.
>
> - MF_MSG_UNKNOWN means identify_page_state() could not classify the
> page; that is precisely the wrong basis for a panic decision.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/memory-failure.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 14c0a958638c..dcd53dbc6aec 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
>
> static int sysctl_enable_soft_offline __read_mostly = 1;
>
> +static int sysctl_panic_on_unrecoverable_mf __read_mostly;
> +
> atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
> static bool hw_memory_failure __read_mostly = false;
> @@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
> .proc_handler = proc_dointvec_minmax,
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> + },
> + {
> + .procname = "panic_on_unrecoverable_memory_failure",
> + .data = &sysctl_panic_on_unrecoverable_mf,
> + .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> }
> };
>
> @@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
> ++mf_stats->total;
> }
>
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> + enum mf_result result)
> +{
> + if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> + return false;
> +
> + return type == MF_MSG_KERNEL;
Would it be more straightforward to write as something like:
if (!sysctl_panic_on_unrecoverable_mf)
return false;
return (type == MF_MSG_KERNEL && result == MF_IGNORED);
Thanks.
.
^ permalink raw reply
* Re: [PATCH v4 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-02 8:52 UTC (permalink / raw)
To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260601153840.124372-7-gmonaco@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> Hybrid Automata monitors hook into the DA implementation when doing
> da_monitor_reset(). This function is called both on initialisation and
> teardown, HA monitors try to cancel a timer only when it's initialised
> relying on the da_mon->monitoring flag. This flag could however be
> corrupted during initialisation. This happens for instance on per-task
> monitors that share the same storage with different type of monitors
> like LTL or in case of races during a previous teardown.
>
> Stop relying on the monitoring flag during initialisation, assume that
> can have any value, so use a separate da_reset_state() skiping timer
> cancellation.
> New monitors (e.g. new tasks) are always zero-initialised so it is safe
> to rely on the monitoring flag for those.
>
> Reported-by: Wen Yang <wen.yang@linux.dev>
> Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
> Suggested-by: Nam Cao <namcao@linutronix.de>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Reviewed-by: Wen Yang <wen.yang@linux.dev>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 8:55 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-7-2f0fae496530@google.com>
On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
nit: s/non-CoCo/CoCo ?
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
nit: Missing Co-Developed-by: ?
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> virt/kvm/guest_memfd.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 78e5435967341..adf57a3a1f5dd 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct inode *inode;
> struct folio *folio;
> int r = 0;
>
> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!file)
> return -EFAULT;
>
> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> + inode = file_inode(file);
> + filemap_invalidate_lock_shared(inode->i_mapping);
>
> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> if (IS_ERR(folio)) {
> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_mark_uptodate(folio);
> }
>
> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> + if (kvm_gmem_is_private_mem(inode, index))
Don't we need to make sure the entire folio is private ? Not just the
page at the index ?
if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Suzuki
> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> folio_unlock(folio);
>
> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_put(folio);
>
> out:
> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> + filemap_invalidate_unlock_shared(inode->i_mapping);
> return r;
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-02 8:57 UTC (permalink / raw)
To: Balbir Singh
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ah47NNhuiClgGCdn@parvat>
On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> >
> > I'm debating on whether to include OPS_MEMPOLICY in the initial version
> > if only because it's not intuitive how it interacts with pagecache. That
> > needs more time to bake.
> >
>
> It makes sense to look at it and then decide if it makes sense.
>
I am thinking i will ship without any OPS flags at all for now and the
have the introduction of ops as a separate series.
> > alloc_pages_node() is the kernel interface
>
> I was think we wouldn't need explicit flags and that allocations would
> happen from user space using __GFP_THISNODE to the node or via a nodemask
> based on nodes of interest. Is there a reason to add this flag, a system
> might have more than one source of N_MEMORY_PRIVATE?
>
There's a few things to unpack here. I discussed this many times on
list and at LSF, but to reiterate.
1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
not particularly useful. Additionally, from userland, it's not
something you can actually set.
for node in possible_nodes:
alloc_pages_node(private_node, __GFP_THISNODE)
In fact it's the opposite semantic of what we want.
THISNODE says: "Do not fallback back to OTHER nodes".
The semantic we want is "Do not allow allocations from private
nodes UNLESS we specifically request" (__GFP_PRIVATE).
__GFP_THISNODE does not actually buy you anything here, AND it's
worse, in the scenario where a private node makes its way into the
preferred slot (via possible_nodes or some other nodemask), the
allocator cannot fall back to a node it can access.
__GFP_THISNODE cannot be overloaded to do anything useful here.
2) We're trying not to expose *ANY* userland APIs for this, at all.
The ultimate goal here should be one of two things:
1) fd = open(/dev/xxx, ...);
mem = mmap(fd, ...);
mem[0] = 0xDEADBEEF; /* Fault device page into page table */
In this case, the driver is responsible for doing the
alloc_pages_node() call.
or
2) mem = mmap(NULL, ..., ANON);
mbind(mem, ..., private_node);
mem[0] = 0xDEADBEEF; /* Fault device page into page table */
in this case mempolicy.c is responsible for doing the
alloc_pages_node() call via the _mpol() alloc variants.
Addition OPT flags (reclaim, compaction, whatever), would
(optionally) allow mm/ to operate on the device memory with, for
example, mmu_notifier callbacks to tell the device to invalidate
whatever it's caching about that page.
This would all be relatively transparent the userland, all userland
"knows" is that it's getting memory from a device (/dev/xxx) or a
node it's otherwise aware of hosting device memory somehow.
~Gregory
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 9:10 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>
On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
>
> nit: s/non-CoCo/CoCo ?
>
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn()
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>
> nit: Missing Co-Developed-by: ?
>
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> virt/kvm/guest_memfd.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> int *max_order)
>> {
>> pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> + struct inode *inode;
>> struct folio *folio;
>> int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> if (!file)
>> return -EFAULT;
>> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> + inode = file_inode(file);
>> + filemap_invalidate_lock_shared(inode->i_mapping);
>> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>> if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_mark_uptodate(folio);
>> }
>> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> + if (kvm_gmem_is_private_mem(inode, index))
>
> Don't we need to make sure the entire folio is private ? Not just the
> page at the index ?
> if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?
Suzuki
>
> Suzuki
>
>> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_put(folio);
>> out:
>> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> + filemap_invalidate_unlock_shared(inode->i_mapping);
>> return r;
>> }
>> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox