Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: (subset) [PATCH v3 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Chuck Lever @ 2026-05-18 16:05 UTC (permalink / raw)
  To: Christian Brauner, Jeff Layton, Chuck Lever
  Cc: Alexander Viro, Jan Kara, Alexander Aring, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Trond Myklebust, Anna Schumaker, Amir Goldstein, Calum Mackay,
	linux-fsdevel, linux-kernel, linux-trace-kernel, linux-doc,
	linux-nfs
In-Reply-To: <20260515-weltschmerz-folgen-68ca0db1ef84@brauner>



On Fri, May 15, 2026, at 1:26 PM, Christian Brauner wrote:
> On Tue, 28 Apr 2026 08:09:44 +0100, Jeff Layton wrote:
>> Re-posting the set per Christian's request. The only difference in this
>> version is a small error handling fix in alloc_init_dir_deleg(). The old
>> version could crash since release_pages() can't handle an array with
>> NULL pointers in it.
>> 
>> ---------------------------------8<------------------------------------
>> 
>> [...]
>
> @Chuck, @Jeff, I've only merged the vfs specific changes into a stable branch.
> You can pull it I won't touch it again. You can pull the nfsd work in in
> whatever form you like. Same procedure I use with io_uring et al.
>
> Let me know if that work for you.
>
> ---
>
> Applied to the vfs-7.2.directory.delegations branch of the vfs/vfs.git 
> tree.
> Patches in the vfs-7.2.directory.delegations branch should appear in 
> linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-7.2.directory.delegations
>
> [01/28] filelock: pass current blocking lease to 
> trace_break_lease_block() rather than "new_fl"
>         https://git.kernel.org/vfs/vfs/c/89330d3a60f7
> [02/28] filelock: add support for ignoring deleg breaks for dir change 
> events
>         https://git.kernel.org/vfs/vfs/c/24cbf43337f4
> [03/28] filelock: add a tracepoint to start of break_lease()
>         https://git.kernel.org/vfs/vfs/c/e39026a86b48
> [04/28] filelock: add an inode_lease_ignore_mask helper
>         https://git.kernel.org/vfs/vfs/c/95825fdcc0b0
> [05/28] fsnotify: new tracepoint in fsnotify()
>         https://git.kernel.org/vfs/vfs/c/ad4489dcd08d
> [06/28] fsnotify: add fsnotify_modify_mark_mask()
>         https://git.kernel.org/vfs/vfs/c/12ffbb117b64
> [07/28] fsnotify: add FSNOTIFY_EVENT_RENAME data type
>         https://git.kernel.org/vfs/vfs/c/010043003c0c

Looks good.

To make the NFSD pieces apply, I need v7.1-rc4 and
vfs-7.2.directory.delegations merged into vfs.all. Given your
regular merge cadence over the past few weeks, I expect that
will happen end of this week? Early next?


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Mark Brown @ 2026-05-18 15:56 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-arm-msm, linux-spi, mukesh.savaliya,
	aniket.randive, chandana.chiluveru, jyothi.seerapu
In-Reply-To: <3b415d4c-7d09-4ddf-847b-b5a3d94aa5e3@oss.qualcomm.com>

[-- Attachment #1: Type: text/plain, Size: 159 bytes --]

On Mon, May 18, 2026 at 09:22:46PM +0530, Praveen Talari wrote:

> Can i use below name or any suggestions?

> +TRACE_EVENT(geni_spi_setup_params

Fine by me.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-18 15:52 UTC (permalink / raw)
  To: Mark Brown
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-arm-msm, linux-spi, mukesh.savaliya,
	aniket.randive, chandana.chiluveru, jyothi.seerapu
In-Reply-To: <a713082f-e84e-403a-af1d-c6fa0c5d8613@sirena.org.uk>

Hi Mark,

On 18-05-2026 19:37, Mark Brown wrote:
> On Tue, May 12, 2026 at 11:42:52AM +0530, Praveen Talari wrote:
>
>> +TRACE_EVENT(geni_spi_fifo_params,
>> +	    TP_PROTO(struct device *dev, u8 cs, u32 mode,
>> +		     u32 mode_changed, bool cs_changed),
>> +	    TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
>> +
>> +	    TP_STRUCT__entry(__string(name, dev_name(dev))
>> +			     __field(u8, cs)
>> +			     __field(u32, mode)
>> +			     __field(u32, mode_changed)
>> +			     __field(bool, cs_changed)
> These don't really seem like FIFO parameters?  I see that's the name of
> the function where we log this but they're more just generic bus status
> things.
I agree with you.

Can i use below name or any suggestions?

+TRACE_EVENT(geni_spi_setup_params

Thanks,

Praveen Talari


^ permalink raw reply

* Re: [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Wen Yang @ 2026-05-18 15:40 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Nam Cao, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
In-Reply-To: <27f7000d27f32ff74f50208779ef26d5566d06f5.camel@redhat.com>



On 5/18/26 14:36, Gabriele Monaco wrote:
> On Sun, 2026-05-17 at 17:52 +0800, Wen Yang wrote:
>>
>> One gap: tools/verification/rvgen/rvgen/templates/dot2k/main.c uses
>> RV_MON_%%MONITOR_TYPE%% but generates no deallocation code, may fail
>> to build with a -Wunused-function warning.
>>
> 
> Thanks for the review!
> 
> That's technically the purpose of this patch, we don't know exactly how is the
> per-obj monitor going to deallocate, so we make sure build fails if they don't
> set up a way.
> 
> This combined with the fact per-obj monitors aren't really documented (yet),
> makes it quite confusing, doesn't it?
> 
> Would you prefer we always generate a dummy hook calling da_destroy_storage()
> and let the user decide what to do with it without forcing (obscure) compiler
> warnings?
> 

Hi Gabriele,

Thanks for the patch and the discussion.

I wonder if generating a dummy hook that calls da_destroy_storage by 
default is the best way to go.
Given that da_destroy_storage internally uses kfree_rcu(), it might 
introduce unnecessary memory allocation/free overhead and could even 
affect RCU grace periods -- especially for monitors that never actually 
need to release objects.

Perhaps a gentler approach for rvgen would be to generate a commented 
example in the template, showing how to use da_skip_deallocation to 
silence the warning.

--
Best wishes,
Wen


> 
>>
>> --
>> Best wishes,
>> Wen
>>
>> On 5/12/26 22:02, Gabriele Monaco wrote:
>>> The per-object monitors use a hash tables and dynamic allocation of the
>>> monitor storage, functions to clean a monitor that is no longer needed
>>> are provided but nothing ensures the monitor actually uses them.
>>>
>>> Remove the inline specifier on the deallocation function to let the
>>> compiler warn in case it isn't referenced. If the monitor really doesn't
>>> need one (for instance because instances will never cease to exist
>>> before disabling the monitor), the da_skip_deallocation() helper macro
>>> can be used to silence the warning.
>>>
>>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>>> ---
>>>    include/rv/da_monitor.h                      | 14 +++++++++++++-
>>>    kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
>>>    2 files changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
>>> index 402d3b935c08..378d23ab7dfb 100644
>>> --- a/include/rv/da_monitor.h
>>> +++ b/include/rv/da_monitor.h
>>> @@ -489,8 +489,11 @@ static inline monitor_target
>>> da_get_target_by_id(da_id_type id)
>>>     * locks.
>>>     * This function includes an RCU read-side critical section to synchronise
>>>     * against da_monitor_destroy().
>>> + * NOTE: inline is omitted on purpose to let the compiler warn if this
>>> function
>>> + * is never referenced. For monitors that don't require a deallocation
>>> hook,
>>> + * da_skip_deallocation() can be used.
>>>     */
>>> -static inline void da_destroy_storage(da_id_type id)
>>> +static void da_destroy_storage(da_id_type id)
>>>    {
>>>    	struct da_monitor_storage *mon_storage;
>>>    
>>> @@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
>>>    	kfree_rcu(mon_storage, rcu);
>>>    }
>>>    
>>> +/*
>>> + * da_skip_deallocation - explicitly mark a deallocation function as not
>>> required
>>> + *
>>> + * Only use when you are absolutely sure the monitor doesn't require a
>>> + * deallocation hook (i.e. it's not possible for an object to finish
>>> existing
>>> + * when the monitor is still running).
>>> + */
>>> +#define da_skip_deallocation(hook) ((void)hook)
>>> +
>>>    static void da_monitor_reset_all(void)
>>>    {
>>>    	struct da_monitor_storage *mon_storage;
>>> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h
>>> b/kernel/trace/rv/monitors/deadline/deadline.h
>>> index 78fca873d61e..c39fd79148c2 100644
>>> --- a/kernel/trace/rv/monitors/deadline/deadline.h
>>> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
>>> @@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data,
>>> struct task_struct *task,
>>>    		da_create_storage(EXPAND_ID_TASK(task), NULL);
>>>    }
>>>    
>>> -static void __maybe_unused handle_exit(void *data, struct task_struct *p,
>>> bool group_dead)
>>> +/*
>>> + * Deallocation hook, use da_skip_deallocation() when not necessary
>>> + */
>>> +static void handle_exit(void *data, struct task_struct *p, bool group_dead)
>>>    {
>>>    	if (p->policy == SCHED_DEADLINE)
>>>    		da_destroy_storage(get_entity_id(&p->dl, DL_TASK,
>>> DL_TASK));
> 

^ permalink raw reply

* [PATCH bpf-next v2 3/3] selftests/bpf: Add test for tracepoint btf_ids tracefs file
From: Mykyta Yatsenko @ 2026-05-18 15:23 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	rostedt
  Cc: Mykyta Yatsenko, linux-trace-kernel
In-Reply-To: <20260518-generic_tracepoint-v2-0-b755a5cf67bb@meta.com>

From: Mykyta Yatsenko <yatsenko@meta.com>

Read events/bpf_testmod/bpf_testmod_test_read/btf_ids and verify the
exported FUNC_PROTO matches the testmod tracepoint signature
(__data, struct task_struct *task, struct bpf_testmod_test_read_ctx
*ctx) and the record struct trace_event_raw_bpf_testmod_test_read
carries the fields declared by TP_STRUCT__entry.

Use the testmod tracepoint so the test exercises the module/split-BTF
path (btf_relocate_id) rather than vmlinux only, and falls back from
/sys/kernel/tracing to /sys/kernel/debug/tracing when tracefs is not
mounted at the new location.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 .../testing/selftests/bpf/prog_tests/tp_btf_ids.c  | 132 +++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/tp_btf_ids.c b/tools/testing/selftests/bpf/prog_tests/tp_btf_ids.c
new file mode 100644
index 000000000000..c0e7e11e71b8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tp_btf_ids.c
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <bpf/btf.h>
+
+#define TRACEFS		"/sys/kernel/tracing"
+#define DEBUGFS_TRACING	"/sys/kernel/debug/tracing"
+#define EVENT_SUBPATH	"events/bpf_testmod/bpf_testmod_test_read/btf_ids"
+
+struct btf_ids_info {
+	__u32 obj_id;
+	__u32 raw_id;
+	__u32 tp_id;
+};
+
+static const char *btf_ids_path(char *buf, size_t sz)
+{
+	if (access(TRACEFS "/trace", F_OK) == 0)
+		snprintf(buf, sz, "%s/%s", TRACEFS, EVENT_SUBPATH);
+	else
+		snprintf(buf, sz, "%s/%s", DEBUGFS_TRACING, EVENT_SUBPATH);
+	return buf;
+}
+
+static int read_btf_ids(struct btf_ids_info *info)
+{
+	char path[256], buf[256];
+	int fd, n;
+
+	fd = open(btf_ids_path(path, sizeof(path)), O_RDONLY);
+	if (fd < 0)
+		return -errno;
+
+	n = read(fd, buf, sizeof(buf) - 1);
+	close(fd);
+	if (n <= 0)
+		return -EIO;
+	buf[n] = '\0';
+
+	if (sscanf(buf,
+		   "btf_obj_id: %u\nraw_btf_id: %u\ntp_btf_id: %u\n",
+		   &info->obj_id, &info->raw_id, &info->tp_id) != 3)
+		return -EINVAL;
+	return 0;
+}
+
+static const char *param_name(struct btf *btf, const struct btf_param *p)
+{
+	return btf__name_by_offset(btf, p->name_off);
+}
+
+static const char *member_name(struct btf *btf, const struct btf_member *m)
+{
+	return btf__name_by_offset(btf, m->name_off);
+}
+
+void test_tp_btf_ids(void)
+{
+	const struct btf_type *proto_t, *rec_t;
+	const struct btf_param *params;
+	const struct btf_member *members;
+	struct btf_ids_info info;
+	struct btf *vmlinux_btf, *btf;
+	const char *name;
+	int err;
+
+	if (!env.has_testmod) {
+		test__skip();
+		return;
+	}
+
+	err = read_btf_ids(&info);
+	if (!ASSERT_OK(err, "read btf_ids"))
+		return;
+
+	ASSERT_GT(info.obj_id, 0, "obj_id non-zero");
+	ASSERT_GT(info.raw_id, 0, "raw_id non-zero");
+	ASSERT_GT(info.tp_id, 0, "tp_id non-zero");
+
+	vmlinux_btf = btf__load_vmlinux_btf();
+	if (!ASSERT_OK_PTR(vmlinux_btf, "load vmlinux BTF"))
+		return;
+
+	/* Module BTF is split BTF; load with vmlinux as base. */
+	btf = btf__load_from_kernel_by_id_split(info.obj_id, vmlinux_btf);
+	if (!ASSERT_OK_PTR(btf, "load module BTF")) {
+		btf__free(vmlinux_btf);
+		return;
+	}
+
+	/*
+	 * raw_btf_id should be the FUNC_PROTO of __bpf_trace_<call>:
+	 *   void *__data, struct task_struct *task,
+	 *   struct bpf_testmod_test_read_ctx *ctx
+	 */
+	proto_t = btf__type_by_id(btf, info.raw_id);
+	if (!ASSERT_OK_PTR(proto_t, "raw type_by_id"))
+		goto out;
+	if (!ASSERT_TRUE(btf_is_func_proto(proto_t), "raw is FUNC_PROTO"))
+		goto out;
+	if (!ASSERT_EQ(btf_vlen(proto_t), 3, "func_proto arg count"))
+		goto out;
+
+	params = btf_params(proto_t);
+	ASSERT_STREQ(param_name(btf, &params[0]), "__data", "arg0 name");
+	ASSERT_STREQ(param_name(btf, &params[1]), "task", "arg1 name");
+	ASSERT_STREQ(param_name(btf, &params[2]), "ctx", "arg2 name");
+
+	/*
+	 * tp_btf_id should be STRUCT trace_event_raw_<call> with the
+	 * fields declared by TP_STRUCT__entry plus the common header.
+	 */
+	rec_t = btf__type_by_id(btf, info.tp_id);
+	if (!ASSERT_OK_PTR(rec_t, "tp type_by_id"))
+		goto out;
+	if (!ASSERT_TRUE(btf_is_struct(rec_t), "tp is STRUCT"))
+		goto out;
+	name = btf__name_by_offset(btf, rec_t->name_off);
+	ASSERT_STREQ(name, "trace_event_raw_bpf_testmod_test_read",
+		     "tp struct name");
+	if (!ASSERT_GE(btf_vlen(rec_t), 5, "tp struct field count"))
+		goto out;
+
+	members = btf_members(rec_t);
+	ASSERT_STREQ(member_name(btf, &members[0]), "ent", "field0 name");
+	ASSERT_STREQ(member_name(btf, &members[1]), "pid", "field1 name");
+	ASSERT_STREQ(member_name(btf, &members[2]), "comm", "field2 name");
+	ASSERT_STREQ(member_name(btf, &members[3]), "off", "field3 name");
+	ASSERT_STREQ(member_name(btf, &members[4]), "len", "field4 name");
+out:
+	btf__free(btf);
+	btf__free(vmlinux_btf);
+}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Mykyta Yatsenko @ 2026-05-18 15:23 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	rostedt
  Cc: Mykyta Yatsenko, linux-trace-kernel
In-Reply-To: <20260518-generic_tracepoint-v2-0-b755a5cf67bb@meta.com>

From: Mykyta Yatsenko <yatsenko@meta.com>

Add events/<sys>/<event>/btf_ids, a per-template file that exposes
the BTF ids resolve_btfids fills in for each tracepoint:

  btf_obj_id  BTF object owning the ids below
  raw_btf_id  FUNC_PROTO of __bpf_trace_<call> (named args), consumed
              by raw_tp / tp_btf BPF programs
  tp_btf_id   trace_event_raw_<call> ring-buffer record, consumed by
              classic BPF_PROG_TYPE_TRACEPOINT programs

DECLARE_EVENT_CLASS now emits a 2-entry BTF_ID_LIST (FUNC __bpf_trace_*
and STRUCT trace_event_raw_*) and stores the pointer in
trace_event_class.

Per-syscall events under syscalls/ share the handcrafted classes
event_class_syscall_{enter,exit} instead of going through
DECLARE_EVENT_CLASS. Wire those classes to the BTF id lists
generated for sys_enter / sys_exit so all  ~700 per-syscall
events expose the shared dispatcher prototype and record.
The per-syscall events do not own their own tracepoint
(they share sys_enter/sys_exit), so raw_btf_id is reported as 0
on those events; the meaningful raw_btf_id is exposed on
raw_syscalls/sys_{enter,exit}/btf_ids where raw_tp / tp_btf
programs can actually attach.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/trace_events.h  |  9 +++++
 include/trace/trace_events.h  | 24 +++++++++++++
 kernel/trace/trace_events.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_syscalls.c | 17 +++++++++
 4 files changed, 129 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d49338c44014..3d55b3cc014a 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -298,6 +298,15 @@ struct trace_event_class {
 	struct list_head	*(*get_fields)(struct trace_event_call *);
 	struct list_head	fields;
 	int			(*raw_init)(struct trace_event_call *);
+#ifdef CONFIG_BPF_EVENTS
+	/*
+	 * Per-template BTF ids set by DECLARE_EVENT_CLASS via BTF_ID() and
+	 * patched by resolve_btfids at link time. NULL for handcrafted classes.
+	 *   [0] FUNC   __bpf_trace_<template>
+	 *   [1] STRUCT trace_event_raw_<template>
+	 */
+	const u32		*btf_ids;
+#endif
 };
 
 extern int trace_event_reg(struct trace_event_call *event,
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index fbc07d353be6..09ad57ac4b73 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -19,6 +19,7 @@
  */
 
 #include <linux/trace_events.h>
+#include <linux/btf_ids.h>
 
 #ifndef TRACE_SYSTEM_VAR
 #define TRACE_SYSTEM_VAR TRACE_SYSTEM
@@ -397,6 +398,27 @@ static inline notrace int trace_event_get_offsets_##call(		\
 #define _TRACE_PERF_INIT(call)
 #endif /* CONFIG_PERF_EVENTS */
 
+#ifdef CONFIG_BPF_EVENTS
+/*
+ * Per-template BTF id list, populated at link time by resolve_btfids:
+ *   [0] FUNC   __bpf_trace_<call>     (the BPF dispatcher)
+ *   [1] STRUCT trace_event_raw_<call> (the ring-buffer record)
+ * Exposed via the events/<sys>/<name>/btf_ids tracefs file.
+ */
+#define _TRACE_BTF_IDS_DECLARE(call)					\
+	extern u32 __bpf_trace_btf_ids_##call[];			\
+	BTF_ID_LIST_GLOBAL(__bpf_trace_btf_ids_##call, 2)		\
+	BTF_ID(func,   __bpf_trace_##call)				\
+	BTF_ID(struct, trace_event_raw_##call)
+
+#define _TRACE_BTF_IDS_INIT(call)					\
+	.btf_ids		= __bpf_trace_btf_ids_##call,
+
+#else
+#define _TRACE_BTF_IDS_DECLARE(call)
+#define _TRACE_BTF_IDS_INIT(call)
+#endif /* CONFIG_BPF_EVENTS */
+
 #include "stages/stage6_event_callback.h"
 
 
@@ -474,6 +496,7 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 _TRACE_PERF_PROTO(call, PARAMS(proto));					\
+_TRACE_BTF_IDS_DECLARE(call)						\
 static char print_fmt_##call[] = print;					\
 static struct trace_event_class __used __refdata event_class_##call = { \
 	.system			= TRACE_SYSTEM_STRING,			\
@@ -483,6 +506,7 @@ static struct trace_event_class __used __refdata event_class_##call = { \
 	.probe			= trace_event_raw_event_##call,		\
 	.reg			= trace_event_reg,			\
 	_TRACE_PERF_INIT(call)						\
+	_TRACE_BTF_IDS_INIT(call)					\
 };
 
 #undef DECLARE_EVENT_SYSCALL_CLASS
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..b1c07f078f8d 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -22,6 +22,7 @@
 #include <linux/sort.h>
 #include <linux/slab.h>
 #include <linux/delay.h>
+#include <linux/btf.h>
 
 #include <trace/events/sched.h>
 #include <trace/syscall.h>
@@ -2200,6 +2201,61 @@ event_id_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos)
 }
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+static ssize_t
+event_btf_ids_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos)
+{
+	struct trace_event_file *file;
+	struct trace_event_call *call;
+	const struct btf_type *t;
+	struct module *mod = NULL;
+	u32 raw_id = 0, tp_id = 0, obj_id = 0;
+	const u32 *ids;
+	struct btf *btf;
+	char buf[128];
+	int len;
+
+	/* Module unload could free call->class and ids[] mid-read. */
+	scoped_guard(mutex, &event_mutex) {
+		file = event_file_file(filp);
+		if (!file)
+			return -ENODEV;
+
+		call = file->event_call;
+		ids = call->class->btf_ids;
+		if (!ids)
+			return -ENOENT;
+		if (!(call->flags & TRACE_EVENT_FL_DYNAMIC))
+			mod = (struct module *)call->module;
+
+		btf = btf_get_module_btf(mod);
+		if (IS_ERR_OR_NULL(btf))
+			return -ENOENT;
+
+		/* Module-local ids in ids[] need base+local relocation. */
+		tp_id = btf_relocate_id(btf, ids[1]);
+
+		/*
+		 * Without FL_TRACEPOINT the dispatcher is shared (e.g. all
+		 * per-syscall events fan out from __bpf_trace_sys_enter), so
+		 * raw_btf_id has no per-event attach point — report 0.
+		 */
+		if (call->flags & TRACE_EVENT_FL_TRACEPOINT) {
+			t = btf_type_by_id(btf, btf_relocate_id(btf, ids[0]));
+			raw_id = t ? t->type : 0;
+		}
+		obj_id = btf_obj_id(btf);
+		btf_put(btf);
+	}
+
+	len = scnprintf(buf, sizeof(buf),
+			"btf_obj_id: %u\nraw_btf_id: %u\ntp_btf_id: %u\n",
+			obj_id, raw_id, tp_id);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
+}
+#endif
+
 static ssize_t
 event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
 		  loff_t *ppos)
@@ -2700,6 +2756,13 @@ static const struct file_operations ftrace_event_id_fops = {
 };
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+static const struct file_operations ftrace_event_btf_ids_fops = {
+	.read = event_btf_ids_read,
+	.llseek = default_llseek,
+};
+#endif
+
 static const struct file_operations ftrace_event_filter_fops = {
 	.open = tracing_open_file_tr,
 	.read = event_filter_read,
@@ -3093,6 +3156,14 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 	}
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+	if (call->class->btf_ids && strcmp(name, "btf_ids") == 0) {
+		*mode = TRACE_MODE_READ;
+		*fops = &ftrace_event_btf_ids_fops;
+		return 1;
+	}
+#endif
+
 #ifdef CONFIG_HIST_TRIGGERS
 	if (strcmp(name, "hist") == 0) {
 		*mode = TRACE_MODE_READ;
@@ -3147,7 +3218,14 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 			.callback	= event_callback,
 		},
 #endif
-#define NR_RO_EVENT_ENTRIES	(1 + IS_ENABLED(CONFIG_PERF_EVENTS))
+#ifdef CONFIG_BPF_EVENTS
+		{
+			.name		= "btf_ids",
+			.callback	= event_callback,
+		},
+#endif
+#define NR_RO_EVENT_ENTRIES	(1 + IS_ENABLED(CONFIG_PERF_EVENTS) + \
+				 IS_ENABLED(CONFIG_BPF_EVENTS))
 /* Readonly files must be above this line and counted by NR_RO_EVENT_ENTRIES. */
 		{
 			.name		= "enable",
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index e98ee7e1e66f..9134461a8def 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -1303,12 +1303,26 @@ struct trace_event_functions exit_syscall_print_funcs = {
 	.trace		= print_syscall_exit,
 };
 
+#ifdef CONFIG_BPF_EVENTS
+/*
+ * BTF id lists generated by DECLARE_EVENT_CLASS for the sys_enter and
+ * sys_exit tracepoints. The auto-generated event_class_sys_{enter,exit}
+ * is unused (per-syscall events share the handcrafted classes below),
+ * but the id lists themselves are global and reusable.
+ */
+extern u32 __bpf_trace_btf_ids_sys_enter[];
+extern u32 __bpf_trace_btf_ids_sys_exit[];
+#endif
+
 struct trace_event_class __refdata event_class_syscall_enter = {
 	.system		= "syscalls",
 	.reg		= syscall_enter_register,
 	.fields_array	= syscall_enter_fields_array,
 	.get_fields	= syscall_get_enter_fields,
 	.raw_init	= init_syscall_trace,
+#ifdef CONFIG_BPF_EVENTS
+	.btf_ids	= __bpf_trace_btf_ids_sys_enter,
+#endif
 };
 
 struct trace_event_class __refdata event_class_syscall_exit = {
@@ -1321,6 +1335,9 @@ struct trace_event_class __refdata event_class_syscall_exit = {
 	},
 	.fields		= LIST_HEAD_INIT(event_class_syscall_exit.fields),
 	.raw_init	= init_syscall_trace,
+#ifdef CONFIG_BPF_EVENTS
+	.btf_ids	= __bpf_trace_btf_ids_sys_exit,
+#endif
 };
 
 unsigned long __init __weak arch_syscall_addr(int nr)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH bpf-next v2 1/3] bpf: Make btf_get_module_btf() and btf_relocate_id() non-static
From: Mykyta Yatsenko @ 2026-05-18 15:23 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	rostedt
  Cc: Mykyta Yatsenko, linux-trace-kernel
In-Reply-To: <20260518-generic_tracepoint-v2-0-b755a5cf67bb@meta.com>

From: Mykyta Yatsenko <yatsenko@meta.com>

Drop the static qualifier and add prototypes to <linux/btf.h> so the
tracing core can look up module BTF and translate ids stored by
resolve_btfids (which are local to a module's split BTF) into the
runtime ids used by the kernel.

Used by the upcoming events/<sys>/<event>/btf_ids tracefs interface.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/btf.h | 2 ++
 kernel/bpf/btf.c    | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 240401d9b25b..273a93a3b2bd 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -235,6 +235,8 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p);
+struct btf *btf_get_module_btf(const struct module *module);
+__u32 btf_relocate_id(const struct btf *btf, __u32 id);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
 					       u32 id, u32 *res_id);
 const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 17d4ab0a8206..4c33dc7b0aef 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6429,7 +6429,7 @@ struct btf *btf_parse_vmlinux(void)
  * split BTF ids will need to be mapped to actual base/split ids for
  * BTF now that it has been relocated.
  */
-static __u32 btf_relocate_id(const struct btf *btf, __u32 id)
+__u32 btf_relocate_id(const struct btf *btf, __u32 id)
 {
 	if (!btf->base_btf || !btf->base_id_map)
 		return id;
@@ -8496,7 +8496,7 @@ struct module *btf_try_get_module(const struct btf *btf)
 /* Returns struct btf corresponding to the struct module.
  * This function can return NULL or ERR_PTR.
  */
-static struct btf *btf_get_module_btf(const struct module *module)
+struct btf *btf_get_module_btf(const struct module *module)
 {
 #ifdef CONFIG_DEBUG_INFO_BTF_MODULES
 	struct btf_module *btf_mod, *tmp;

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH bpf-next v2 0/3] tracing: Expose tracepoint BTF ids via tracefs
From: Mykyta Yatsenko @ 2026-05-18 15:23 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	rostedt
  Cc: Mykyta Yatsenko, linux-trace-kernel

BPF and other consumers that want to attach to or decode a generic
tracepoint need three pieces of BTF information for it:

  - the BTF of the object that owns the tracepoint's types
  - the FUNC_PROTO describing the tracepoint arguments (with names),
    consumed by raw_tp / tp_btf BPF programs
  - the STRUCT id of trace_event_raw_<call>, the ring-buffer record
    consumed by classic BPF_PROG_TYPE_TRACEPOINT programs

Today none of this is easily discoverable from userspace. The kernel
knows the ids - resolve_btfids fills them in at link time - but
consumers have to search them by the naming convention
("__bpf_trace_<name>", "trace_event_raw_<name>"), walking BTF for
every tracepoint.

This series stores those ids in trace_event_class and exposes them
via events/<sys>/<event>/btf_ids, e.g.

  # cat /sys/kernel/tracing/events/sched/sched_switch/btf_ids
    btf_obj_id: 1
    raw_btf_id: 28882
    tp_btf_id: 106335

  # bpftool btf dump id 1 root_id 28882 format raw
  [28882] FUNC_PROTO '(anon)' ret_type_id=0 vlen=5
        '__data' type_id=9
        'preempt' type_id=60674
        'prev' type_id=219
        'next' type_id=219
        'prev_state' type_id=108689

  # bpftool btf dump id 1 root_id 106335 format raw
  [106335] STRUCT 'trace_event_raw_sched_switch' size=64 vlen=9
        'ent' type_id=104654 bits_offset=0
        'prev_comm' type_id=580 bits_offset=64
        'prev_pid' type_id=92875 bits_offset=192
        'prev_prio' type_id=79365 bits_offset=224
        'prev_state' type_id=83958 bits_offset=256
        'next_comm' type_id=580 bits_offset=320
        'next_pid' type_id=92875 bits_offset=448
        'next_prio' type_id=79365 bits_offset=480
        '__data' type_id=407 bits_offset=512

For per-syscall events (all sharing the same dispatcher), raw_btf_id
is 0 — raw_tp / tp_btf programs attach to raw_syscalls/sys_{enter,exit},
not per-syscall events:

  # cat /sys/kernel/tracing/events/syscalls/sys_enter_write/btf_ids
    btf_obj_id: 1
    raw_btf_id: 0
    tp_btf_id: 106540

This unlocks few usecases for consumers:

  - Resolving tp_btf attach targets and argument types directly,
    instead of constructing "__bpf_trace_*" names and
    re-discovering them in vmlinux BTF.
  - Get a stable, machine-readable contract for tracepoint payloads,
    with field names preserved.

Patch 1 exports the two BTF helpers the tracing core needs.
Patch 2 wires DECLARE_EVENT_CLASS to publish the ids, adds the tracefs
        reader, and wires the syscall classes so per-syscall events
        carry tp_btf_id (raw_btf_id is 0 there — see above).
Patch 3 adds a selftest covering the sched_switch tracepoint.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Changes in v2:
- kernel/bpf/btf.c: dropped both EXPORT_SYMBOL_GPL()
- kernel/trace/trace_events.c (event_btf_ids_read):
  replaced guard(mutex)(&event_mutex) with explicit
  mutex_lock/mutex_unlock. scnprintf() and simple_read_from_buffer()
  (which calls copy_to_user()) now run outside the lock work.
- tools/testing/selftests/bpf/prog_tests/tp_btf_ids.c:
  - Added if (!env.has_testmod) { test__skip(); return; } at the top of
    test_tp_btf_ids() so the test skips gracefully when bpf_testmod.ko
    is absent.
  - Wrapped ASSERT_EQ(btf_vlen(proto_t), 3, ...) with if (!...) goto out;
    to prevent OOB read of params[2].
  - Added if (!ASSERT_GE(btf_vlen(rec_t), 5, ...)) goto out; before reading
    members[0..4].
- Link to v1: https://patch.msgid.link/20260515-generic_tracepoint-v1-0-aa619fa94132@meta.com

---
Mykyta Yatsenko (3):
      bpf: Make btf_get_module_btf() and btf_relocate_id() non-static
      tracing: Expose tracepoint BTF ids via tracefs
      selftests/bpf: Add test for tracepoint btf_ids tracefs file

 include/linux/btf.h                                |   2 +
 include/linux/trace_events.h                       |   9 ++
 include/trace/trace_events.h                       |  24 ++++
 kernel/bpf/btf.c                                   |   4 +-
 kernel/trace/trace_events.c                        |  80 ++++++++++++-
 kernel/trace/trace_syscalls.c                      |  17 +++
 .../testing/selftests/bpf/prog_tests/tp_btf_ids.c  | 132 +++++++++++++++++++++
 7 files changed, 265 insertions(+), 3 deletions(-)
---
base-commit: 8668cd470c38011c44a42f6c7b188f4149f23a7a
change-id: 20260508-generic_tracepoint-d488a5a7ab18

Best regards,
--  
Mykyta Yatsenko <yatsenko@meta.com>


^ permalink raw reply

* Re: [PATCH v3 06/11] drm: Use trace_call__##name() at guarded tracepoint call sites
From: Philipp Stanner @ 2026-05-18 15:01 UTC (permalink / raw)
  To: Vineeth Pillai (Google), Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Leo Li,
	Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: amd-gfx, dri-devel, Steven Rostedt, linux-trace-kernel,
	Peter Zijlstra
In-Reply-To: <20260515135932.2238842-1-vineeth@bitbyteword.org>

On Fri, 2026-05-15 at 09:59 -0400, Vineeth Pillai (Google) wrote:
> From: Vineeth Pillai <vineeth@bitbyteword.org>
> 
> Replace trace_foo() with the new trace_call__foo() at sites already
> guarded by trace_foo_enabled(), avoiding a redundant
> static_branch_unlikely() re-evaluation inside the tracepoint.
> trace_call__foo() calls the tracepoint callbacks directly without
> utilizing the static branch again.

The "foo" terminology is unusual I think? I always wrote it with regex,
like "trace_*()".



> 
> Original v2 series:
> https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

I'd put this in a Link: tag section below.

> 
> Parts of the original v2 series have already been merged in mainline.
> This patch is being reposted as a follow-up cleanup for the remaining
> unmerged pieces.

So this v3 series as a whole is a followup to that v2?

> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Assisted-by: Claude:claude-sonnet-4-6
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  4 ++--
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 +++++-----
>  drivers/gpu/drm/scheduler/sched_entity.c          |  5 +++--
>  4 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index b24d5d21be5f..cb0b5cb07d57 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1004,7 +1004,7 @@ static void trace_amdgpu_cs_ibs(struct amdgpu_cs_parser *p)
>  		struct amdgpu_job *job = p->jobs[i];
>  
>  		for (j = 0; j < job->num_ibs; ++j)
> -			trace_amdgpu_cs(p, job, &job->ibs[j]);
> +			trace_call__amdgpu_cs(p, job, &job->ibs[j]);
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 9ba9de16a27a..a36ae94c425f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1415,7 +1415,7 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va,
>  
>  	if (trace_amdgpu_vm_bo_mapping_enabled()) {
>  		list_for_each_entry(mapping, &bo_va->valids, list)
> -			trace_amdgpu_vm_bo_mapping(mapping);
> +			trace_call__amdgpu_vm_bo_mapping(mapping);
>  	}
>  
>  error_free:
> @@ -2183,7 +2183,7 @@ void amdgpu_vm_bo_trace_cs(struct amdgpu_vm *vm, struct ww_acquire_ctx *ticket)
>  				continue;
>  		}
>  
> -		trace_amdgpu_vm_bo_cs(mapping);
> +		trace_call__amdgpu_vm_bo_cs(mapping);
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 5fc5d5608506..fbdc12cdd6bb 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -5263,11 +5263,11 @@ static void amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm,
>  	}
>  
>  	if (trace_amdgpu_dm_brightness_enabled()) {
> -		trace_amdgpu_dm_brightness(__builtin_return_address(0),
> -					   user_brightness,
> -					   brightness,
> -					   caps->aux_support,
> -					   power_supply_is_system_supplied() > 0);
> +		trace_call__amdgpu_dm_brightness(__builtin_return_address(0),
> +						 user_brightness,
> +						 brightness,
> +						 caps->aux_support,
> +						 power_supply_is_system_supplied() > 0);
>  	}
>  
>  	if (caps->aux_support) {
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index fe174a4857be..185a2636b599 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -429,7 +429,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>  
>  	if (trace_drm_sched_job_unschedulable_enabled() &&
>  	    !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &entity->dependency->flags))
> -		trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
> +		trace_call__drm_sched_job_unschedulable(sched_job,
> +							entity->dependency);

I would be more happy if you sacrifice a bit of space here and keep it
a single line since the if condition is already quite convoluted and
challenging to read.


P.

>  
>  	if (!dma_fence_add_callback(entity->dependency, &entity->cb,
>  				    drm_sched_entity_wakeup))
> @@ -586,7 +587,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  		unsigned long index;
>  
>  		xa_for_each(&sched_job->dependencies, index, entry)
> -			trace_drm_sched_job_add_dep(sched_job, entry);
> +			trace_call__drm_sched_job_add_dep(sched_job, entry);
>  	}
>  	atomic_inc(entity->rq->sched->score);
>  	WRITE_ONCE(entity->last_user, current->group_leader);


^ permalink raw reply

* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Lance Yang @ 2026-05-18 14:49 UTC (permalink / raw)
  To: Usama Arif, Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260518115553.3513034-1-usama.arif@linux.dev>



On 2026/5/18 19:55, Usama Arif wrote:
[...]
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 979885694351..f0e29d5c7b1f 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>   }
>>   
>>   static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>> -		struct collapse_control *cc)
>> +		struct collapse_control *cc, unsigned int order)
>>   {
>>   	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>>   		     GFP_TRANSHUGE);
>>   	int node = collapse_find_target_node(cc);
>>   	struct folio *folio;
>>   
>> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
>> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>>   	if (!folio) {
>>   		*foliop = NULL;
>> -		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>> +		if (is_pmd_order(order))
>> +			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>> +		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
>>   		return SCAN_ALLOC_HUGE_PAGE_FAIL;
>>   	}
>>   
>> -	count_vm_event(THP_COLLAPSE_ALLOC);
>> +	if (is_pmd_order(order))
>> +		count_vm_event(THP_COLLAPSE_ALLOC);
>> +	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
>> +
> 
> The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
> But after this we have
> 
> 	count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> 
> which is not being guarded with is_pmd_order().

Good catch!

> 
> I think we want this to be pmd order only as well so that
> the meaning of the vmstat and cgroup counter remains the same?

Agreed. THP_COLLAPSE_ALLOC should remain PMD order only for
vmstat and memcg events.

So this should be guarded with is_pmd_order() as well :)

Cheers, Lance

^ permalink raw reply

* Re: [PATCH 02/13] verification/rvgen: Introduce a parse tree for automata using Lark
From: Wander Lairson Costa @ 2026-05-18 14:45 UTC (permalink / raw)
  To: Nam Cao; +Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <87pl2t89ue.fsf@yellow.woof>

On Mon, May 18, 2026 at 4:18 AM Nam Cao <namcao@linutronix.de> wrote:
>
> Wander Lairson Costa <wander@redhat.com> writes:
> > On Tue, May 05, 2026 at 08:59:23AM +0200, Nam Cao wrote:
> >> +    ID: /[_a-zA-Z][_a-zA-Z0-9]+/
> >
> > This regex rejects symbol character symbol. Is that intentional?
>
> It wasn't intentional. This is blindly copied from the existing regex.
>
> Let me switch to Lark's CNAME.
>

Note: there is a type. s/symbol/single/.

> Nam
>


^ permalink raw reply

* Re: [PATCH 06/13] verification/rvgen: Convert __fill_verify_guards_func() to Lark
From: Wander Lairson Costa @ 2026-05-18 14:44 UTC (permalink / raw)
  To: Nam Cao; +Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <87jyt189oe.fsf@yellow.woof>

On Mon, May 18, 2026 at 4:21 AM Nam Cao <namcao@linutronix.de> wrote:
>
> Wander Lairson Costa <wander@redhat.com> writes:
> >> +        if not self.has_guard:
> >> +            return
> >
> > The signature of function says this function return a list, instead of
> > None.
>
> Can you share the tools you are using to catch these? Or did you notice
> that yourself?
>

I use pyright [1] with vim integration.

[1] https://github.com/microsoft/pyright

> Nam
>


^ permalink raw reply

* Re: [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Mark Brown @ 2026-05-18 14:07 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-arm-msm, linux-spi, mukesh.savaliya,
	aniket.randive, chandana.chiluveru, jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-1-3b184068ecf9@oss.qualcomm.com>

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Tue, May 12, 2026 at 11:42:52AM +0530, Praveen Talari wrote:

> +TRACE_EVENT(geni_spi_fifo_params,
> +	    TP_PROTO(struct device *dev, u8 cs, u32 mode,
> +		     u32 mode_changed, bool cs_changed),
> +	    TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
> +
> +	    TP_STRUCT__entry(__string(name, dev_name(dev))
> +			     __field(u8, cs)
> +			     __field(u32, mode)
> +			     __field(u32, mode_changed)
> +			     __field(bool, cs_changed)

These don't really seem like FIFO parameters?  I see that's the name of
the function where we log this but they're more just generic bus status
things.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH v3] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-18 13:58 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the FETCHARGS parsing of probes to allow the use of
structure and member names to get the offsets to dereference pointers.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
reference. For example, to get the size of a kmem_cache that was passed to
the function kmem_cache_alloc_noprof, one would need to do:

 # cd /sys/kernel/tracing
 # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events

This requires knowing that the offset of size is 0x18, which can be found
with gdb:

  (gdb) p &((struct kmem_cache *)0)->size
  $1 = (unsigned int *) 0x18

If BTF is in the kernel, it can be used to find this with names, where the
user doesn't need to find the actual offset:

 # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events

Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:

  +STRUCT.MEMBER[.MEMBER[..]]

The delimiter is '.' and the first item is the structure name. Then the
member of the structure to get the offset of. If that member is an
embedded structure, another '.MEMBER' may be added to get the offset of
its members with respect to the original value.

  "+kmem_cache.size($arg1)" is equivalent to:

  (*(struct kmem_cache *)$arg1).size

Anonymous structures are also handled:

  # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events

Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:

  (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)

Note that "dev" of struct sk_buff is inside an anonymous structure:

struct sk_buff {
	union {
		struct {
			/* These two members must be first to match sk_buff_head. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				[..]
			};
		};
		[..]
	};

This will allow up to three deep of anonymous structures before it will
fail to find a member.

The above produces:

    sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"

And nested structures can be found by adding more members to the arg:

  # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events

The above is equivalent to:

  *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)

And produces:

       trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v2: https://patch.msgid.link/20260516173310.1dbad146@fedora

- Added skip_modifies when looking up field (Sashiko)

- Pass -E2BIG error to caller if that was the issue (Sashiko)

- Fix btf_put() error path (Sashiko)

- Update error log on -ENOENT (Sashiko)

 Documentation/trace/kprobetrace.rst |   3 +
 kernel/trace/trace_btf.c            | 115 ++++++++++++++++++++++++++++
 kernel/trace/trace_btf.h            |  10 +++
 kernel/trace/trace_probe.c          |  20 ++++-
 kernel/trace/trace_probe.h          |   4 +-
 5 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..00273157100c 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -54,6 +54,8 @@ Synopsis of kprobe_events
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  +STRUCT.MEMBER[.MEMBER[..]](FETCHARG) : If BTF is supported, Fetch memory
+		  at FETCHARG + the offset of MEMBER inside of STRUCT.(\*5)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
@@ -70,6 +72,7 @@ Synopsis of kprobe_events
         accesses one register.
   (\*3) this is useful for fetching a field of data structures.
   (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
+  (\*5) +STRUCT.MEMBER(FETCHARG) is equivalent to (*(struct STRUCT *)(FETCHARG)).MEMBER
 
 Function arguments at kretprobe
 -------------------------------
diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
index 00172f301f25..ca09982d8dbe 100644
--- a/kernel/trace/trace_btf.c
+++ b/kernel/trace/trace_btf.c
@@ -120,3 +120,118 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 	return member;
 }
 
+#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
+
+static int find_member(const char *ptr, struct btf *btf,
+		       const struct btf_type **type, int level)
+{
+	const struct btf_member *member;
+	const struct btf_type *t = *type;
+	int i;
+
+	/* Max of 3 depth of anonymous structures */
+	if (level > 3)
+		return -E2BIG;
+
+	for_each_member(i, t, member) {
+		const char *tname = btf_name_by_offset(btf, member->name_off);
+
+		if (strcmp(ptr, tname) == 0) {
+			int offset = __btf_member_bit_offset(t, member);
+			*type = btf_type_skip_modifiers(btf, member->type, NULL);
+			return BITS_ROUNDDOWN_BYTES(offset);
+		}
+
+		/* Handle anonymous structures */
+		if (strlen(tname))
+			continue;
+
+		*type = btf_type_by_id(btf, member->type);
+		if (btf_type_is_struct(*type)) {
+			int offset = find_member(ptr, btf, type, level + 1);
+
+			if (offset < 0) {
+				if (offset == -ENOENT)
+					continue;
+				return offset;
+			}
+
+			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
+		}
+	}
+
+	return -ENOENT;
+}
+
+/**
+ * btf_find_offset - Find an offset of a member for a structure
+ * @arg: A structure name followed by one or more members
+ * @offset_p: A pointer to where to store the offset
+ *
+ * Will parse @arg with the expected format of: struct.member[[.member]..]
+ * It is delimited by '.'. The first item must be a structure type.
+ * The next are its members. If the member is also of a structure type it
+ * another member may follow ".member".
+ *
+ * Note, @arg is modified but will be put back to what it was on return.
+ *
+ * Returns: 0 on success and -EINVAL if no '.' is present
+ *    or -ENXIO if the structure or member is not found.
+ *    Returns -EINVAL if BTF is not defined.
+ *  On success, @offset_p will contain the offset of the member specified
+ *    by @arg.
+ */
+int btf_find_offset(char *arg, long *offset_p)
+{
+	const struct btf_type *t;
+	struct btf *btf;
+	long offset = 0;
+	char *ptr;
+	int ret;
+	s32 id;
+
+	ptr = strchr(arg, '.');
+	if (!ptr)
+		return -EINVAL;
+
+	*ptr = '\0';
+
+	ret = -ENXIO;
+	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		goto error;
+
+	/* Get BTF_KIND_FUNC type */
+	t = btf_type_by_id(btf, id);
+
+	/* May allow more than one member, as long as they are structures */
+	do {
+		ret = -ENXIO;
+		if (!t || !btf_type_is_struct(t))
+			goto error_put;
+
+		*ptr++ = '.';
+		arg = ptr;
+		ptr = strchr(ptr, '.');
+		if (ptr)
+			*ptr = '\0';
+
+		ret = find_member(arg, btf, &t, 0);
+		if (ret < 0)
+			goto error_put;
+
+		offset += ret;
+
+	} while (ptr);
+
+	btf_put(btf);
+	*offset_p = offset;
+	return 0;
+
+error_put:
+	btf_put(btf);
+error:
+	if (ptr)
+		*ptr = '.';
+	return ret;
+}
diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
index 4bc44bc261e6..7b0797a6050b 100644
--- a/kernel/trace/trace_btf.h
+++ b/kernel/trace/trace_btf.h
@@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 						const struct btf_type *type,
 						const char *member_name,
 						u32 *anon_offset);
+
+#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
+/* Will modify arg, but will put it back before returning. */
+int btf_find_offset(char *arg, long *offset);
+#else
+static inline int btf_find_offset(char *arg, long *offset)
+{
+	return -EINVAL;
+}
+#endif
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..74c4255da307 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 
 	case '+':	/* deref memory */
 	case '-':
-		if (arg[1] == 'u') {
+		if (arg[1] == 'u' && isdigit(arg[2])) {
 			deref = FETCH_OP_UDEREF;
 			arg[1] = arg[0];
 			arg++;
@@ -1178,7 +1178,23 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 			return -EINVAL;
 		}
 		*tmp = '\0';
-		ret = kstrtol(arg, 0, &offset);
+		if (arg[0] != '-' && !isdigit(*arg)) {
+			int err = 0;
+			ret = btf_find_offset(arg, &offset);
+			switch (ret) {
+			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
+			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
+			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
+			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;
+			case -ENOENT: err = TP_ERR_NO_BTF_FIELD; break;
+			}
+			if (err)
+				__trace_probe_log_err(ctx->offset, err);
+			if (ret < 0)
+				return ret;
+		} else {
+			ret = kstrtol(arg, 0, &offset);
+		}
 		if (ret) {
 			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
 			break;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..d649bb9f5b7c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
+	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
+	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Usama Arif @ 2026-05-18 13:49 UTC (permalink / raw)
  To: Nico Pache
  Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-5-npache@redhat.com>

On Mon, 11 May 2026 12:58:04 -0600 Nico Pache <npache@redhat.com> wrote:

> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
> 
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
> 
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
> 
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
> 
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
> 
> This removes the possiblilty of "creep", while not modifying any uAPI
> expectations. A warning will be emitted if any non-supported
> max_ptes_none value is configured with mTHP enabled.
> 
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
> 
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
> 
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> 
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/trace/events/huge_memory.h |   3 +-
>  mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
>  2 files changed, 85 insertions(+), 35 deletions(-)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index bcdc57eea270..443e0bd13fdb 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -39,7 +39,8 @@
>  	EM( SCAN_STORE_FAILED,		"store_failed")			\
>  	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
>  	EM( SCAN_PAGE_FILLED,		"page_filled")			\
> -	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> +	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
> +	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
>  
>  #undef EM
>  #undef EMe
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f68853b3caa7..27465161fa6d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,7 @@ enum scan_result {
>  	SCAN_COPY_MC,
>  	SCAN_PAGE_FILLED,
>  	SCAN_PAGE_DIRTY_OR_WRITEBACK,
> +	SCAN_INVALID_PTES_NONE,
>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>   * PTEs for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of none-page or zero-page PTEs allowed for the
>   * collapse operation.
>   */
> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -		struct vm_area_struct *vma)
> +static int collapse_max_ptes_none(struct collapse_control *cc,
> +		struct vm_area_struct *vma, unsigned int order)
>  {
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
>  	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>  	if (vma && userfaultfd_armed(vma))
>  		return 0;
>  	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> -	// For all other cases repect the user defined maximum.
> -	return khugepaged_max_ptes_none;
> +	// for PMD collapse, respect the user defined maximum.
> +	if (is_pmd_order(order))
> +		return max_ptes_none;
> +	/* Zero/non-present collapse disabled. */
> +	if (!max_ptes_none)
> +		return 0;
> +	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> +	// scale the maximum number of PTEs to the order of the collapse.
> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +		return (1 << order) - 1;
> +
> +	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> +	// Emit a warning and return -EINVAL.
> +	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> +		      KHUGEPAGED_MAX_PTES_LIMIT);
> +	return -EINVAL;
>  }
>  
>  /**
>   * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>   * anonymous pages for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of PTEs that map shared anonymous pages for the
>   * collapse operation
>   */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	// for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
>  	// anonymous pages.
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	// for mTHP collapse do not allow collapsing anonymous memory pages that
> +	// are shared between processes.
> +	if (!is_pmd_order(order))
> +		return 0;
> +	// for PMD collapse, respect the user defined maximum.
>  	return khugepaged_max_ptes_shared;
>  }
>  
> @@ -391,16 +415,22 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>   * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>   * maximum allowed non-present pagecache entries for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>   * pagecache entries for the collapse operation.
>   */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	// for MADV_COLLAPSE, do not restrict the number PTEs entries or
>  	// pagecache entries that are non-present.
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	// for mTHP collapse do not allow any non-present PTEs or pagecache entries.
> +	if (!is_pmd_order(order))
> +		return 0;
> +	// for PMD collapse, respect the user defined maximum.
>  	return khugepaged_max_ptes_swap;
>  }
>  
> @@ -594,18 +624,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  
>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> -		struct list_head *compound_pagelist)
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
> -	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +	unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +
> +	if (max_ptes_none < 0)
> +		return SCAN_INVALID_PTES_NONE;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> @@ -738,18 +772,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  }
>  
>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> -						struct vm_area_struct *vma,
> -						unsigned long address,
> -						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +		struct vm_area_struct *vma, unsigned long address,
> +		spinlock_t *ptl, unsigned int order,
> +		struct list_head *compound_pagelist)
>  {
> -	unsigned long end = address + HPAGE_PMD_SIZE;
> +	const unsigned long nr_pages = 1UL << order;
> +	unsigned long end = address + (PAGE_SIZE << order);
>  	struct folio *src, *tmp;
>  	pte_t pteval;
>  	pte_t *_pte;
>  	unsigned int nr_ptes;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>  	     address += nr_ptes * PAGE_SIZE) {
>  		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
> @@ -802,11 +836,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  }
>  
>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> -					     pmd_t *pmd,
> -					     pmd_t orig_pmd,
> -					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	spinlock_t *pmd_ptl;
>  
>  	/*
> @@ -822,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>  }
>  
>  /*
> @@ -842,16 +875,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   */
>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> -		unsigned long address, spinlock_t *ptl,
> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	unsigned int i;
>  	enum scan_result result = SCAN_SUCCEED;
>  
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -870,10 +904,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>  
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    order, compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 order, compound_pagelist);
>  
>  	return result;
>  }
> @@ -1044,12 +1078,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>   */


Can you add a comment above __collapse_huge_page_swapin function that says its only
done for PMD size only? Something like:

For PMD-order collapse this faults in any swap entries it finds. For mTHP
orders the function bails on the first swap entry with SCAN_EXCEED_SWAP_PTE,
because faulting pages back in during a lower-order collapse could re-populate
PTEs that push a later scan over the threshold for a higher-order collapse.


>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> -		int referenced)
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		pmd_t *pmd, int referenced, unsigned int order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>  	enum scan_result result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1081,6 +1115,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		    pte_present(vmf.orig_pte))
>  			continue;
>  
> +		/*
> +		 * TODO: Support swapin without leading to further mTHP
> +		 * collapses. Currently bringing in new pages via swapin may
> +		 * cause a future higher order collapse on a rescan of the same
> +		 * range.
> +		 */
> +		if (!is_pmd_order(order)) {
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +
>  		vmf.pte = pte;
>  		vmf.ptl = ptl;
>  		ret = do_swap_page(&vmf);
> @@ -1200,7 +1247,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
>  		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
> +						     referenced, HPAGE_PMD_ORDER);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1248,6 +1295,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> +						      HPAGE_PMD_ORDER,
>  						      &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
> @@ -1278,6 +1326,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> +					   HPAGE_PMD_ORDER,
>  					   &compound_pagelist);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
> @@ -1313,9 +1362,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2369,8 +2418,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);
> -- 
> 2.54.0
> 
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Konrad Dybcio @ 2026-05-18 13:48 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Greg Kroah-Hartman, Jiri Slaby
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-1-a5726421b3af@oss.qualcomm.com>

On 5/12/26 7:14 PM, Praveen Talari wrote:
> Add tracepoint support to the Qualcomm GENI serial driver to provide
> runtime visibility into driver behavior without requiring invasive debug
> patches.
> 
> The trace events cover UART termios configuration, clock setup, modem
> control state, interrupt status, and TX/RX data, making it easier to
> diagnose communication issues in the field.
> 
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---
> v1->v2:
> - Removed multiple TX/RX trace events, instead used
>   DECLARE_EVENT_CLASS and DEFINE_EVENT.
> ---
>  include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
>  1 file changed, 172 insertions(+)
> 
> diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
> new file mode 100644
> index 000000000000..5e23827881d0
> --- /dev/null
> +++ b/include/trace/events/qcom_geni_serial.h

Oh, I only noticed now that this isn't in a subsystem/driver-
local directory.. I suppose it's up to the other maintainers
whether they like that

Konrad

^ permalink raw reply

* Re: [PATCH v2 2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
From: Konrad Dybcio @ 2026-05-18 13:44 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-2-3b184068ecf9@oss.qualcomm.com>

On 5/12/26 8:12 AM, Praveen Talari wrote:
> Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
> These trace events enable runtime debugging and performance analysis
> of SPI operations.
> 
> The trace events capture SPI clock configuration, FIFO parameters,
> transfer details, interrupt status.
> 
> Usage examples:
> 
> Enable all SPI traces:
>   echo 1 > /sys/kernel/tracing/events/spi/enable
>   echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
>   cat /sys/kernel/debug/tracing/trace_pipe
> 
> Example trace output:
> 
> 1003.956560: spi_message_submit: spi16.0 000000001b20b93c
> 1003.956642: spi_controller_busy: spi16
> 1003.956643: spi_message_start: spi16.0 000000001b20b93c
> 1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
>      mode_changed=0x00000007 cs_changed=0
> 1003.956647: spi_set_cs: spi16.0 activate
> 1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
>      tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
> rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
> 1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
>      sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
> 1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
> 1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
>      dma_tx=0x00000000 dma_rx=0x00000000
> 1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
>      tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
> rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
> 1003.956717: spi_set_cs: spi16.0 deactivate
> 1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16

Same feedback regarding this part of the commit message as on the
UART patch

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply

* Re: [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Konrad Dybcio @ 2026-05-18 13:44 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-1-3b184068ecf9@oss.qualcomm.com>

On 5/12/26 8:12 AM, Praveen Talari wrote:
> Add tracepoint support to the Qualcomm GENI SPI driver to provide
> runtime visibility into driver behavior without requiring invasive debug
> patches.
> 
> The trace events cover clock and FIFO parameter configuration,
> transfer metadata, interrupt status to be making it easier to diagnose
> communication issues in the field..
> 
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---
> v1->v2:
> - Removed TX/RX data tracepoints.
> - Updated commit text.
> ---

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply

* Re: [PATCH v2 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Konrad Dybcio @ 2026-05-18 13:41 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Greg Kroah-Hartman, Jiri Slaby
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-2-a5726421b3af@oss.qualcomm.com>

On 5/12/26 7:14 PM, Praveen Talari wrote:
> Add tracing to the Qualcomm GENI serial driver to improve runtime
> observability.
> 
> Trace hooks are added at key points including termios and clock
> configuration, manual control get/set, interrupt handling, and data
> TX/RX paths.
> 
> Usage examples:
> 
> Enable all serial traces:
>   echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
>   cat /sys/kernel/debug/tracing/trace_pipe
> 
> Example trace output:
> 
> 2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
>      clk_rate=7372800 clk_div=4 clk_idx=0
> 2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
>      s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
> 2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
>      tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
> rx_par=0x00000000 stop=0
> 2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
>      uart_manual_rfr=0x00000000
> 2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
>      geni_ios=0x00000001
> 2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
>      s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
> 2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
>      64 65 66 67 68
> 2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
>      s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
> 2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
>      s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
> 2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
>      64 65 66 67 68
> 2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
>      uart_manual_rfr=0x80000002

I think the example (or at least the data that it produces) could go
under the --- line, there's plenty of docs regarding tracing on
docs.kernel.org

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply

* Re: [PATCH v2 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Konrad Dybcio @ 2026-05-18 13:40 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Greg Kroah-Hartman, Jiri Slaby
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-1-a5726421b3af@oss.qualcomm.com>

On 5/12/26 7:14 PM, Praveen Talari wrote:
> Add tracepoint support to the Qualcomm GENI serial driver to provide
> runtime visibility into driver behavior without requiring invasive debug
> patches.
> 
> The trace events cover UART termios configuration, clock setup, modem
> control state, interrupt status, and TX/RX data, making it easier to
> diagnose communication issues in the field.
> 
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---

[...]

> +DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,
> +
> +	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> +
> +	TP_ARGS(dev, buf, len)
> +
> +);
> +
> +DEFINE_EVENT(geni_serial_data, geni_serial_rx_data,
> +
> +	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> +
> +	TP_ARGS(dev, buf, len)
> +
> +);

stray \ns above

otherwise lgtm

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply

* Re: [PATCH v6 0/2] blk-mq: introduce tag starvation observability
From: Jens Axboe @ 2026-05-18 13:31 UTC (permalink / raw)
  To: Aaron Tomlin, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-1-atomlin@atomlin.com>

On 5/17/26 3:36 PM, Aaron Tomlin wrote:
> Hi Jens, Steve, Masami,
> 
> In high-performance storage environments, particularly when utilising RAID
> controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
> spikes can occur when fast devices are starved of available tags.
> Currently, diagnosing this specific queue contention requires deploying
> dynamic kprobes or inferring sleep states, which lacks a simple,
> out-of-the-box diagnostic path.
> 
> This short series introduces dedicated, low-overhead observability for tag
> exhaustion events in the block layer:
> 
>   - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
>     allocation slow-path to capture precise, event-based starvation.
> 
>   - Patch 2 complements this by exposing "wait_on_hw_tag" and
>     "wait_on_sched_tag" per-CPU counters via debugfs for quick,
>     point-in-time cumulative polling.
> 
> Together, these provide storage engineers with zero-configuration
> mechanisms to definitively identify shared-tag bottlenecks.

Why not just issue the trace points? Then there's close to zero
overhead, rather than needing to need added counters for this, and the
kernel to keep track. If you just issue the get/put tag kind of traces,
then userspace can keep track. That's what blktrace has done for decades
for things like inflight/queue depth accounting.

IOW, seems to me, this could be done with basically zero kernel
additions outside of perhaps a trace point or two.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: David Hildenbrand (Arm) @ 2026-05-18 13:16 UTC (permalink / raw)
  To: Wei Yang, Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260514031009.f66cgop3ctgiqxz3@master>

On 5/14/26 05:10, Wei Yang wrote:
> On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
>>
>> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>>> functions to support future mTHP collapse.
>>>
>>> The current mechanism for determining collapse with the
>>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>>> raises a key design issue: if we support user defined max_pte_none values
>>> (even those scaled by order), a collapse of a lower order can introduces
>>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>>> than HPAGE_PMD_NR / 2. [1]
>>>
>>> With this configuration, a successful collapse to order N will populate
>>> enough pages to satisfy the collapse condition on order N+1 on the next
>>> scan. This leads to unnecessary work and memory churn.
>>>
>>> To fix this issue introduce a helper function that will limit mTHP
>>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>>> This effectively supports two modes: [2]
>>>
>>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>>>  that maps the shared zeropage. Consequently, no memory bloat.
>>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>>  available mTHP order.
>>>
>>> This removes the possiblilty of "creep", while not modifying any uAPI
>>> expectations. A warning will be emitted if any non-supported
>>> max_ptes_none value is configured with mTHP enabled.
>>>
>>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>> shared or swapped entry.
>>>
>>> No functional changes in this patch; however it defines future behavior
>>> for mTHP collapse.
>>>
>>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>>>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> include/trace/events/huge_memory.h |   3 +-
>>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
>>> 2 files changed, 85 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>> index bcdc57eea270..443e0bd13fdb 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -39,7 +39,8 @@
>>> 	EM( SCAN_STORE_FAILED,		"store_failed")			\
>>> 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
>>> 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
>>> -	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>>> +	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
>>> +	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
>>>
>>> #undef EM
>>> #undef EMe
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index f68853b3caa7..27465161fa6d 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -61,6 +61,7 @@ enum scan_result {
>>> 	SCAN_COPY_MC,
>>> 	SCAN_PAGE_FILLED,
>>> 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
>>> +	SCAN_INVALID_PTES_NONE,
>>> };
>>>
>>> #define CREATE_TRACE_POINTS
>>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>>>  * PTEs for the given collapse operation.
>>>  * @cc: The collapse control struct
>>>  * @vma: The vma to check for userfaultfd
>>> + * @order: The folio order being collapsed to
>>>  *
>>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
>>>  * collapse operation.
>>>  */
>>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>>> -		struct vm_area_struct *vma)
>>> +static int collapse_max_ptes_none(struct collapse_control *cc,
>>> +		struct vm_area_struct *vma, unsigned int order)
>>> {
>>> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
>>> 	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>>
>> One thing I still want to call out: kernel code usually uses C-style
>> comments :)
>>
>>> 	if (vma && userfaultfd_armed(vma))
>>> 		return 0;
>>> 	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>>> 	if (!cc->is_khugepaged)
>>> 		return HPAGE_PMD_NR;
>>> -	// For all other cases repect the user defined maximum.
>>> -	return khugepaged_max_ptes_none;
>>> +	// for PMD collapse, respect the user defined maximum.
>>> +	if (is_pmd_order(order))
>>> +		return max_ptes_none;
>>> +	/* Zero/non-present collapse disabled. */
>>> +	if (!max_ptes_none)
>>> +		return 0;
>>> +	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>>> +	// scale the maximum number of PTEs to the order of the collapse.
>>> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>>> +		return (1 << order) - 1;
>>> +
>>> +	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>>> +	// Emit a warning and return -EINVAL.
>>> +	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>>> +		      KHUGEPAGED_MAX_PTES_LIMIT);
>>
>> Maybe fallback to 0 instead, as David suggested earlier?
>>
> 
> It looks reasonable to fallback to 0.
> 
> But as the updated Document says in patch 14:
> 
>   For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
>   value will emit a warning and no mTHP collapse will be attempted.
> 
> This is why it does like this now.
> 
>     mthp_collapse()
>         max_ptes_none = collapse_max_ptes_none();
>         if (max_ptes_none < 0)
>             return collapsed;
> 
>> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
>> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
>> disable it :(
>>
> 
> So it depends on what we want to do here :-)
> 
> For me, I would vote for fallback to 0.

At this point I'll prefer to not return errors from collapse_max_ptes_none().
It's just rather awkward to return an error deep down in collapse code for a
configuration problem.

For mthp collapse, we only support max_ptes_none==0 and
max_ptes_none=="HPAGE_PMD_NR - 1" (default).

If another value is specified while collapsing mTHP, print a warning and treat
it as 0 (save value, no creep, no memory waste).

In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
warning, because we would issue a warning with the default settings).

@Lorenzo, fine with you?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Wei Yang @ 2026-05-18 12:50 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
>The following series provides khugepaged with the capability to collapse
>anonymous memory regions to mTHPs.
>
>To achieve this we generalize the khugepaged functions to no longer depend
>on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>pages that are occupied (!none/zero). After the PMD scan is done, we use
>the bitmap to find the optimal mTHP sizes for the PMD range. The
>restriction on max_ptes_none is removed during the scan, to make sure we
>account for the whole PMD range in the bitmap. When no mTHP size is
>enabled, the legacy behavior of khugepaged is maintained.
>
>We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
>(ie 511). If any other value is specified, the kernel will emit a warning
>and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
>but contains swapped out, or shared pages, we don't perform the collapse.
>It is now also possible to collapse to mTHPs without requiring the PMD THP
>size to be enabled. These limitations are to prevent collapse "creep"
>behavior. This prevents constantly promoting mTHPs to the next available
>size, which would occur because a collapse introduces more non-zero pages
>that would satisfy the promotion condition on subsequent scans.
>
>Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
>	     for arbitrary orders.
>Patch 3:     Rework max_ptes_* handling into helper functions
>Patch 4:     Generalize __collapse_huge_page_* for mTHP support
>Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
>Patch 6:     Generalize collapse_huge_page for mTHP collapse
>Patch 7:     Skip collapsing mTHP to smaller orders
>Patch 8-9:   Add per-order mTHP statistics and tracepoints
>Patch 10:    Introduce collapse_allowable_orders helper function
>Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
>Patch 14:    Documentation
>
>Testing:
>- Built for x86_64, aarch64, ppc64le, and s390x
>- ran all arches on test suites provided by the kernel-tests project
>- internal testing suites: functional testing and performance testing
>- selftests mm
>- I created a test script that I used to push khugepaged to its limits
>   while monitoring a number of stats and tracepoints. The code is
>   available here[1] (Run in legacy mode for these changes and set mthp
>   sizes to inherit)
>   The summary from my testings was that there was no significant
>   regression noticed through this test. In some cases my changes had
>   better collapse latencies, and was able to scan more pages in the same
>   amount of time/work, but for the most part the results were consistent.
>- redis testing. I did some testing with these changes along with my defer
>  changes (see followup [2] post for more details). We've decided to get
>  the mTHP changes merged first before attempting the defer series.
>- some basic testing on 64k page size.
>- lots of general use.
>

Two links are missing. I got them from previous version.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

And the test in [1] is a performance test. I am thinking whether we want a
functional test in selftests.

I did a quick try with following change and some hack.

@@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 	ksft_test_result_report(exit_status, "%s\n", __func__);
 }
 
+static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
+{
+	struct thp_settings settings = *thp_current_settings();
+	void *p;
+	int i;
+
+	/* Disable mthp on fault */
+	for (i = 0; i < NR_ORDERS; i++) {
+		settings.hugepages[i].enabled = THP_NEVER;
+	}
+	thp_push_settings(&settings);
+
+	p = ops->setup_area(1);
+
+	ops->fault(p, 0, hpage_pmd_size);
+
+	/* Expect all order-0 folio after fault */
+	memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+	expected_orders[0] = hpage_pmd_nr;
+	if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+					   kpageflags_fd, expected_orders,
+					   (pmd_order + 1)))
+		ksft_exit_fail_msg("Unexpected huge page at fault\n");
+
+	/* Enable mthp before collapse */
+	thp_pop_settings();
+	settings.hugepages[2].enabled = THP_ALWAYS;
+	thp_push_settings(&settings);
+
+	c->collapse("Collapse fully populated PTE table with order 2", p, 1,
+		    ops, true);
+
+	/* Expect all order-2 folio after collapse */
+	memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+	expected_orders[2] = 1 << (pmd_order - 2);
+	if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+					   kpageflags_fd, expected_orders,
+					   (pmd_order + 1)))
+		ksft_exit_fail_msg("Unexpected page order\n");
+
+	ops->cleanup_area(p, hpage_pmd_size);
+	thp_pop_settings();
+	ksft_test_result_report(exit_status, "%s\n", __func__);
+}
+
 static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;

This leverage check_after_split_folio_orders() in split_huge_page_test.c to
check folio order in PMD range.

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH 8/9] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-05-18 12:18 UTC (permalink / raw)
  To: Wen Yang; +Cc: linux-kernel, Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <843882d4-3cf1-403f-8d49-172c8efb8201@linux.dev>

On Sun, 2026-05-17 at 17:40 +0800, Wen Yang wrote:
> ha_cancel_timer_sync() is the right choice for task exit; the
> ha_mon_initializing guard correctly handles the init-window race.
> 
> One issue: after ha_monitor_disable_hook(), an in-flight
> ha_handle_sched_process_exit() handler may still be executing.  It
> reads task_mon_slot via da_get_monitor() (&p->rv[task_mon_slot]);
> da_monitor_sync_hook() = synchronize_rcu() cannot drain it because
> tracepoint handlers run outside any RCU read-side section.  If
> rv_put_task_monitor_slot() writes RV_PER_TASK_MONITOR_INIT to
> task_mon_slot first, the handler dereferences an OOB index.
> 
> This is the same race Patch 5 closes for PER_OBJ with
> tracepoint_synchronize_unregister(); the PER_TASK da_monitor_destroy()
> needs the same call (and so does every other PER_TASK monitor, not only
> the new exit handler).
> 
> Could you add tracepoint_synchronize_unregister() to the PER_TASK
> da_monitor_destroy() ?  Alternatively, we can carry the fix on top of
> your series.

Yeah you're right, that's the neatest way to solve it.

Indeed, any in-flight handler would do da_get_monitor() not only the newly added
exit hook, so we're going to need tracepoint_synchronize_unregister() before
touching task_mon_slot in any per-task monitor (not only HA, also DA and even
LTL).

Feel free to send your patch and I'll apply it to this series, including also
ltl_monitor_destroy().

Thanks,
Gabriele

> 
> --
> Best wishes,
> Wen
> 
> 
> On 5/12/26 22:02, Gabriele Monaco wrote:
> > Hybrid automata monitors may start timers, depending on the model, these
> > may remain active on an exiting task and cause false positives or even
> > access freed memory.
> > 
> > Add an enable/disable hook in the HA code, currently only populated by
> > the per-task handler for registration and deregistration.
> > This hooks to the sched_process_exit event and ensures the timer is
> > stopped for every exiting task. The handler is enabled automatically but
> > may be disabled, for instance if the monitor uses the event for another
> > purpose (but should still manually ensure timers are stopped).
> > 
> > Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > ---
> >   include/rv/ha_monitor.h | 44 +++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 44 insertions(+)
> > 
> > diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> > index 11ae85bad492..1bdf866e9c63 100644
> > --- a/include/rv/ha_monitor.h
> > +++ b/include/rv/ha_monitor.h
> > @@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor
> > *da_mon);
> >   static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
> >   static inline void ha_setup_timer(struct ha_monitor *ha_mon);
> >   static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
> > +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
> >   static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
> >   					 enum states curr_state,
> >   					 enum events event,
> > @@ -38,6 +39,26 @@ static bool ha_monitor_handle_constraint(struct
> > da_monitor *da_mon,
> >   #define da_monitor_reset_hook ha_monitor_reset_env
> >   #define da_monitor_sync_hook() synchronize_rcu()
> >   
> > +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> > +/*
> > + * Automatic cleanup handlers for per-task HA monitors, only skip if you
> > know
> > + * what you are doing (e.g. you want to implement cleanup manually in
> > another
> > + * handler doing more things).
> > + */
> > +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> > +					 bool group_dead);
> > +
> > +#define
> > ha_monitor_enable_hook()                                             \
> > +	rv_attach_trace_probe(__stringify(MONITOR_NAME),
> > sched_process_exit, \
> > +			      ha_handle_sched_process_exit)
> > +#define
> > ha_monitor_disable_hook()                                            \
> > +	rv_detach_trace_probe(__stringify(MONITOR_NAME),
> > sched_process_exit, \
> > +			      ha_handle_sched_process_exit)
> > +#else
> > +#define ha_monitor_enable_hook()
> > +#define ha_monitor_disable_hook()
> > +#endif
> > +
> >   #include <rv/da_monitor.h>
> >   #include <linux/seq_buf.h>
> >   
> > @@ -124,12 +145,14 @@ static int ha_monitor_init(void)
> >   
> >   	ha_mon_initializing = true;
> >   	ret = da_monitor_init();
> > +	ha_monitor_enable_hook();
> >   	ha_mon_initializing = false;
> >   	return ret;
> >   }
> >   
> >   static void ha_monitor_destroy(void)
> >   {
> > +	ha_monitor_disable_hook();
> >   	da_monitor_destroy();
> >   }
> >   
> > @@ -230,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor
> > *ha_mon,
> >   {
> >   	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event,
> > env);
> >   }
> > +
> > +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> > +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> > +					 bool group_dead)
> > +{
> > +	struct da_monitor *da_mon = da_get_monitor(p);
> > +
> > +	if (likely(!ha_monitor_uninitialized(da_mon)))
> > +		ha_cancel_timer_sync(to_ha_monitor(da_mon));
> > +}
> > +#endif
> > +
> >   #endif /* RV_MON_TYPE */
> >   
> >   /*
> > @@ -455,6 +490,10 @@ static inline bool ha_cancel_timer(struct ha_monitor
> > *ha_mon)
> >   {
> >   	return timer_delete(&ha_mon->timer);
> >   }
> > +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> > +{
> > +	timer_delete_sync(&ha_mon->timer);
> > +}
> >   #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
> >   /*
> >    * Helper functions to handle the monitor timer.
> > @@ -506,6 +545,10 @@ static inline bool ha_cancel_timer(struct ha_monitor
> > *ha_mon)
> >   {
> >   	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
> >   }
> > +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> > +{
> > +	hrtimer_cancel(&ha_mon->hrtimer);
> > +}
> >   #else /* HA_TIMER_NONE */
> >   /*
> >    * Start function is intentionally not defined, monitors using timers must
> > @@ -516,6 +559,7 @@ static inline bool ha_cancel_timer(struct ha_monitor
> > *ha_mon)
> >   {
> >   	return false;
> >   }
> > +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
> >   #endif
> >   
> >   #endif


^ permalink raw reply

* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Usama Arif @ 2026-05-18 11:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-3-npache@redhat.com>

On Mon, 11 May 2026 12:58:02 -0600 Nico Pache <npache@redhat.com> wrote:

> From: Dev Jain <dev.jain@arm.com>
> 
> Pass order to alloc_charge_folio() and update mTHP statistics.
> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
>  include/linux/huge_mm.h                    |  2 ++
>  mm/huge_memory.c                           |  4 ++++
>  mm/khugepaged.c                            | 17 +++++++++++------
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 5fbc3d89bb07..c51932e6275d 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -639,6 +639,14 @@ anon_fault_fallback_charge
>  	instead falls back to using huge pages with lower orders or
>  	small pages even though the allocation was successful.
>  
> +collapse_alloc
> +	is incremented every time a huge page is successfully allocated for a
> +	khugepaged collapse.
> +
> +collapse_alloc_failed
> +	is incremented every time a huge page allocation fails during a
> +	khugepaged collapse.
> +
>  zswpout
>  	is incremented every time a huge page is swapped out to zswap in one
>  	piece without splitting.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2949e5acff35..ba7ae6808544 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -128,6 +128,8 @@ enum mthp_stat_item {
>  	MTHP_STAT_ANON_FAULT_ALLOC,
>  	MTHP_STAT_ANON_FAULT_FALLBACK,
>  	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> +	MTHP_STAT_COLLAPSE_ALLOC,
> +	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
>  	MTHP_STAT_ZSWPOUT,
>  	MTHP_STAT_SWPIN,
>  	MTHP_STAT_SWPIN_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9d499da0ac7..05f482a72a89 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -699,6 +699,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>  DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
>  DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>  DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> @@ -764,6 +766,8 @@ static struct attribute *any_stats_attrs[] = {
>  #endif
>  	&split_attr.attr,
>  	&split_failed_attr.attr,
> +	&collapse_alloc_attr.attr,
> +	&collapse_alloc_failed_attr.attr,
>  	NULL,
>  };
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 979885694351..f0e29d5c7b1f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  }
>  
>  static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -		struct collapse_control *cc)
> +		struct collapse_control *cc, unsigned int order)
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
>  	int node = collapse_find_target_node(cc);
>  	struct folio *folio;
>  
> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>  	if (!folio) {
>  		*foliop = NULL;
> -		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +		if (is_pmd_order(order))
> +			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
>  		return SCAN_ALLOC_HUGE_PAGE_FAIL;
>  	}
>  
> -	count_vm_event(THP_COLLAPSE_ALLOC);
> +	if (is_pmd_order(order))
> +		count_vm_event(THP_COLLAPSE_ALLOC);
> +	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> +

The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
But after this we have

	count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);

which is not being guarded with is_pmd_order().

I think we want this to be pmd order only as well so that
the meaning of the vmstat and cgroup counter remains the same?


>  	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
>  		folio_put(folio);
>  		*foliop = NULL;
> @@ -1118,7 +1123,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 */
>  	mmap_read_unlock(mm);
>  
> -	result = alloc_charge_folio(&folio, mm, cc);
> +	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
> @@ -1899,7 +1904,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>  	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>  
> -	result = alloc_charge_folio(&new_folio, mm, cc);
> +	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> -- 
> 2.54.0
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox