Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-15 12:31 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <2qkbqj7c2bi7li4crheoarasvokrtxbb7ikofdv5zvsvgww5lx@bjd73tm2prfj>

On Thu, May 14, 2026 at 06:54:37PM +0200, Jakub Sitnicki wrote:
> On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> > 
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> > 
> >   lea -0x80(%rsp), %rsp
> >   call tramp
> > 
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> > 
> > The optimized uprobe performance stays the same:
> > 
> >         uprobe-nop     :    3.129 ± 0.013M/s
> >         uprobe-push    :    3.045 ± 0.006M/s
> >         uprobe-ret     :    1.095 ± 0.004M/s
> >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> >         uretprobe-nop  :    2.143 ± 0.021M/s
> >         uretprobe-push :    2.090 ± 0.000M/s
> >         uretprobe-ret  :    0.942 ± 0.000M/s
> >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> >         usdt-nop       :    3.245 ± 0.004M/s
> >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> > 
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
> >  1 file changed, 86 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..f7c4101a4039 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -636,9 +636,21 @@ struct uprobe_trampoline {
> >  	unsigned long		vaddr;
> >  };
> >  
> > +#define LEA_INSN_SIZE		5
> > +#define OPT_INSN_SIZE		(LEA_INSN_SIZE + CALL_INSN_SIZE)
> > +#define OPT_JMP8_OFFSET		(OPT_INSN_SIZE - JMP8_INSN_SIZE)
> > +#define REDZONE_SIZE		0x80
> > +
> > +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> > +
> > +static bool is_lea_insn(const uprobe_opcode_t *insn)
> > +{
> > +	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> > +}
> > +
> 
> Just a thought. See if below maybe reads better when plugged in.
> is_call_insn can then be removed, I think.
> 
> static bool is_call_past_redzone_insns(const uprobe_opcode_t *insn)
> {
> 	static const u8 lea_rsp_call[] = {
> 		0x48, 0x8d, 0x64, 0x24, REDZONE_SIZE, /* lea -0x80(%rsp), %rsp */
> 		CALL_INSN_OPCODE
> 	};
> 
> 	return !memcmp(insn, lea_rsp_call, ARRAY_SIZE(lea_rsp_call));
> }

yep, might be easier to unify that, thanks

jirka

^ permalink raw reply

* Re: [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: Jakub Sitnicki @ 2026-05-15 11:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-3-jolsa@kernel.org>

On Thu, May 14, 2026 at 03:53 PM +02, Jiri Olsa wrote:
> We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> fixing has_nop_combo to reflect that.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/lib/bpf/usdt.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> index e3710933fd52..7e62e4d5bedd 100644
> --- a/tools/lib/bpf/usdt.c
> +++ b/tools/lib/bpf/usdt.c
> @@ -305,7 +305,7 @@ struct usdt_manager *usdt_manager_new(struct bpf_object *obj)
>  
>  	/*
>  	 * Detect kernel support for uprobe() syscall, it's presence means we can
> -	 * take advantage of faster nop5 uprobe handling.
> +	 * take advantage of faster nop10 uprobe handling.
>  	 * Added in: 56101b69c919 ("uprobes/x86: Add uprobe syscall to speed up uprobe")
>  	 */
>  	man->has_uprobe_syscall = kernel_supports(obj, FEAT_UPROBE_SYSCALL);
> @@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
>  #if defined(__x86_64__)
>  static bool has_nop_combo(int fd, long off)
>  {
> -	unsigned char nop_combo[6] = {
> -		0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
> +	unsigned char nop_combo[11] = {
> +		0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
>  	};
> -	unsigned char buf[6];
> +	unsigned char buf[11];
>  
> -	if (pread(fd, buf, 6, off) != 6)
> +	if (pread(fd, buf, 11, off) != 11)
>  		return false;
> -	return memcmp(buf, nop_combo, 6) == 0;
> +	return memcmp(buf, nop_combo, 11) == 0;
>  }
>  #else
>  static bool has_nop_combo(int fd, long off)

Nit: Would use ARRAY_SIZE(buf) instead of repeating the scalar value
in multiple places. Otherwise:

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* [PATCH] [v2] tracing: samples: avoid unexpected symbol warnings (arm, s390)
From: Arnd Bergmann @ 2026-05-15 10:57 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Marc Zyngier, Nathan Chancellor,
	Vincent Donnefort
  Cc: Arnd Bergmann, Mathieu Desnoyers, Paolo Bonzini, linux-kernel,
	linux-trace-kernel

From: Arnd Bergmann <arnd@arndb.de>

The now more verbose check found more architecture specific symbol
missing from the whitelist, during randconfig testing on s390
and 32-bit arm:

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
         U __aeabi_unwind_cpp_pr1

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                 U __s390_indirect_jump_r1
                 U __s390_indirect_jump_r10
                 U __s390_indirect_jump_r14
                 U __s390_indirect_jump_r2
                 U __s390_indirect_jump_r5
                 U __s390_indirect_jump_r7
                 U __s390_indirect_jump_r8
                 U __s390_indirect_jump_r9
make[6]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:160: kernel/trace/simple_ring_buffer.o.checked] Error 1

Add these to the list and keep it roughly sorted into sanitizer
and architecture symbols.

Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
v2: add both s390 and arm symbols
---
 kernel/trace/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..9b0834134cae 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -143,8 +143,8 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
 targets += undefsyms_base.o
 KASAN_SANITIZE_undefsyms_base.o := y
 
-UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
-		      __msan simple_ring_buffer \
+UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
+		      __aeabi_unwind_cpp __s390_indirect_jump __x86_indirect_thunk simple_ring_buffer \
 		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
 
 quiet_cmd_check_undefined = NM      $<
-- 
2.39.5


^ permalink raw reply related

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Arnd Bergmann @ 2026-05-15 10:56 UTC (permalink / raw)
  To: Vincent Donnefort, Steven Rostedt
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <agWb6DvB1MdJ12cB@google.com>

On Thu, May 14, 2026, at 11:54, Vincent Donnefort wrote:
> On Wed, May 13, 2026 at 10:59:39AM -0400, Steven Rostedt wrote:
>> 
>> Vincent,
>> 
>> Is this patch needed? That is, did it fall through the cracks?
>
> Yes, I believe it is! 
>
> Reviewed-by: Vincent Donnefort <vdonnefort@google.com>

Just today, I came across yet another one:

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                 U __s390_indirect_jump_r1
                 U __s390_indirect_jump_r10
                 U __s390_indirect_jump_r14
                 U __s390_indirect_jump_r2
                 U __s390_indirect_jump_r5
                 U __s390_indirect_jump_r7
                 U __s390_indirect_jump_r8
                 U __s390_indirect_jump_r9

I'll send a replacement patch that addresses both, since the
old one hasn't been applied yet.

      Arnd

^ permalink raw reply

* Re: [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor
From: Gabriele Monaco @ 2026-05-15  9:53 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Introduce tlob (task latency over budget), a per-task hybrid-automaton
> RV monitor that measures elapsed time (CLOCK_MONOTONIC) across
> a user-delimited code section and fires an error_env_tlob tracepoint
> when the elapsed time exceeds a configurable per-invocation budget.
> 
> The monitor is built on RV_MON_PER_OBJ with HA_TIMER_HRTIMER.  Three
> states track the scheduler status of the monitored task:
> 
>   running  --(sleep)-------> sleeping
>   running  --(preempt)-----> waiting
>   sleeping --(wakeup)------> waiting
>   waiting  --(switch_in)--> running
> 
> A single clock invariant clk_elapsed < BUDGET_NS() is active in all
> three states.  The budget hrtimer is rearmed on each DA transition for
> the remaining budget, keeping the absolute deadline fixed at
> start_time + BUDGET_NS.
> 
> Per-task state is stored in the DA framework's hash table keyed by
> task->pid.  Storage is pre-allocated by tlob_start_task() with
> GFP_KERNEL via da_create_or_get() before the scheduler tracepoints
> can fire, using DA_SKIP_AUTO_ALLOC so that no kmalloc occurs on the
> tracepoint hot path.  This avoids both the kmalloc_nolock() restriction
> (requires HAVE_ALIGNED_STRUCT_PAGE) and latency issues under PREEMPT_RT.
> 
> Nested monitoring is handled by nest_depth: tlob_start_task() on an
> already-monitored pid returns -EEXIST and increments nest_depth without
> disturbing the outer window; only the outermost tlob_stop_task()
> performs real cleanup.
> 
> Two userspace interfaces are provided.  The ioctl interface exposes
> in-process self-instrumentation via /dev/rv with TLOB_IOCTL_TRACE_START
> and TLOB_IOCTL_TRACE_STOP.  The uprobe interface enables external
> monitoring of unmodified binaries via tracefs:
> 
>   echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
>       > /sys/kernel/tracing/rv/monitors/tlob/monitor
> 
> Violations are reported via error_env_tlob (HA clock-invariant)
> regardless of which interface triggered them.
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
[...]
> diff --git a/include/linux/rv.h b/include/linux/rv.h
> index 541ba404926a..1ea91bb3f1c2 100644
> --- a/include/linux/rv.h
> +++ b/include/linux/rv.h
> @@ -21,6 +21,13 @@
>  #include <linux/list.h>
>  #include <linux/types.h>
>  
> +/* Forward declaration: poll_table is only needed by rv_chardev_ops::poll.
> + * Avoid pulling in <linux/poll.h> from rv.h — that header is included by
> + * sched.h, and poll.h → fs.h → rcupdate.h creates a header-ordering cycle
> + * with migrate_disable() on UML/non-SMP targets.
> + */
> +struct poll_table_struct;
> +
>  /*
>   * Deterministic automaton per-object variables.
>   */
> @@ -158,6 +165,44 @@ int rv_register_monitor(struct rv_monitor *monitor,
> struct rv_monitor *parent);
>  int rv_get_task_monitor_slot(void);
>  void rv_put_task_monitor_slot(int slot);

Could you have everything that isn't strictly tlob-related in another
patch. This adds the ioctl functionality, can it stay on its own until
you wire it with tlob?

[...]

> diff --git a/include/rv/automata.h b/include/rv/automata.h
> index 4a4eb40cf09a..ae819638d85a 100644
> --- a/include/rv/automata.h
> +++ b/include/rv/automata.h
> @@ -41,6 +41,21 @@ static char *model_get_event_name(enum events event)
>  	return RV_AUTOMATON_NAME.event_names[event];
>  }
>  
> +/*
> + * model_get_timer_event_name - label used when the HA timer fires (no
> event).
> + *
> + * Monitors may define MONITOR_TIMER_EVENT_NAME before including the model
> + * header to give the timer-fired violation a semantically meaningful label
> + * (e.g. "budget_exceeded" for tlob).  Defaults to "none".
> + */
> +#ifndef MONITOR_TIMER_EVENT_NAME
> +#define MONITOR_TIMER_EVENT_NAME "none"
> +#endif

Why don't you just override EVENT_NONE_LBL (and if you prefer call it
MONITOR_TIMER_EVENT_NAME) without the need for another function?

> +static inline char *model_get_timer_event_name(void)
> +{
> +	return MONITOR_TIMER_EVENT_NAME;
> +}
> +

[...]

> diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
> index 084cdb36a2ff..9106c5c9275e 100644
> --- a/include/rv/rv_uprobe.h
> +++ b/include/rv/rv_uprobe.h
> @@ -79,9 +79,41 @@ struct rv_uprobe *rv_uprobe_attach(const char *binpath,
> loff_t offset,
>   * for any in-progress handler to finish, then releases the path reference
>   * and frees the rv_uprobe struct.  The caller's priv data is NOT freed.
>   *
> + * When removing a single probe, prefer this over the three-phase API.
>   * Safe to call from process context only (uprobe_unregister_sync() may
>   * schedule).
>   */
>  void rv_uprobe_detach(struct rv_uprobe *p);

Why don't you put all this in the patch about uprobes?

>  
> +/**
> + * rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
> + * @p:  probe to dequeue; may be NULL (no-op)
> + *
> + * Removes the uprobe from the uprobe subsystem but does NOT wait for
> + * in-flight handlers to complete.  The caller must call rv_uprobe_sync()
> + * before calling rv_uprobe_free() on the same probe.
> + *
> + * Use this to batch multiple deregistrations before a single
> rv_uprobe_sync().
> + */
> +void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
> +
> +/**
> + * rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
> + *
> + * Global barrier: waits for every in-flight uprobe handler across the system
> + * to finish.  Call once after a batch of rv_uprobe_unregister_nosync() calls
> + * and before any rv_uprobe_free() call.
> + */
> +void rv_uprobe_sync(void);
> +
> +/**
> + * rv_uprobe_free - release resources of a previously deregistered probe
> + * @p:  probe to free; may be NULL (no-op)
> + *
> + * Releases the path reference and frees the rv_uprobe struct.  Must only
> + * be called after rv_uprobe_sync() has returned.  The caller's priv data
> + * is NOT freed.
> + */
> +void rv_uprobe_free(struct rv_uprobe *p);
> +
>  #endif /* _RV_UPROBE_H */
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */

Same as above, this could be in a separate patch.

> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
[...]
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index f139b904bea3..8a5b5c84aff9 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -2,7 +2,7 @@
>  
>  ccflags-y += -I $(src)		# needed for trace events
>  
> -obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV) += rv.o rv_chardev.o

Same here.

>  obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>  obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>  obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,69 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> +	depends on RV
> +	select RV_UPROBE
> +	select HA_MON_EVENTS_ID
> +	bool "tlob monitor"
> +	help
> +	  Enable the tlob (task latency over budget) monitor.  This monitor
> +	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> +	  within a task (including both on-CPU and off-CPU time) and reports
> +	  a violation when the elapsed time exceeds a configurable budget.
> +
> +	  The monitor uses a three-state hybrid automaton (running, waiting,
> +	  sleeping) stored per object using RV_MON_PER_OBJ.  A single HA
> +	  clock invariant (clk_elapsed < BUDGET_NS) is enforced in all three
> +	  states via a per-task hrtimer.
> +
> +	  States: running (initial, on-CPU), waiting (in runqueue, off-CPU),
> +	          sleeping (blocked on resource, off-CPU).
> +	  Key transitions:
> +	    running  --(sleep)------> sleeping
> +	    running  --(preempt)----> waiting
> +	    sleeping --(wakeup)-----> waiting
> +	    waiting  --(switch_in)--> running
> +	  task_start calls da_handle_start_event() to set the initial state,
> +	  then arms the budget timer directly via ha_reset_clk_ns() +
> +	  ha_start_timer_ns().  task_stop cancels the timer synchronously via
> +	  ha_cancel_timer_sync() then calls da_monitor_reset().
> +
> +	  Two userspace interfaces are provided:
> +
> +	  tracefs uprobe binding (external, unmodified binaries):
> +	    echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
> +	        > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +	  The uprobe at offset_start fires tlob_start_task(); the uprobe at
> +	  offset_stop fires tlob_stop_task().  Both are plain entry uprobes
> +	  so a mistyped offset cannot corrupt the call stack.
> +
> +	  /dev/rv ioctl (in-process self-instrumentation):
> +	    ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +	    do_critical_work();
> +	    ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	    /* ret == -EOVERFLOW when budget exceeded */
> +	  Allows conditional monitoring, sub-function granularity, and
> +	  inline reaction to violations without polling the trace buffer.
> +
> +	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> +	  Violations are always reported via the standard error_env_tlob RV
> +	  tracepoint regardless of which interface triggered them.  The
> +	  tracefs interface requires only tracefs write permissions, avoiding
> +	  the CAP_BPF privilege needed for equivalent eBPF-based approaches.
> +
> +	  For further information, see:
> +	    Documentation/trace/rv/monitor_tlob.rst
> +
> +config TLOB_KUNIT_TEST

Do you need to add this here? Since you have a patch adding KUnit tests
to tlob, cannot you put everything kunit-related there?

That's also going to simplify things since RV KUnits aren't stable right
now.

> +	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS

I couldn't build it as module, do we need it that way?

  ERROR: modpost: "sched_setscheduler_nocheck" [kernel/trace/rv/monitors/tlob/tlob_kunit.ko] undefined!

> +	depends on RV_MON_TLOB && KUNIT
> +	default KUNIT_ALL_TESTS
> +	help
> +	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> +	  Tests cover automaton state transitions, the start/stop task
> +	  interface, scheduler context-switch accounting, and the uprobe
> +	  format string parser.
> +
> +	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000000..475e972ae9aa
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,1307 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * On a budget violation, two tracepoints are emitted from the hrtimer
> + * callback: error_env_tlob signals the violation, and detail_env_tlob
> + * provides a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
> + * that pinpoints whether the overrun occurred in running, waiting, or
> sleeping state.
> + *
> + * The monitor uses RV_MON_PER_OBJ: per-task state (struct tlob_task_state)
> + * is stored as monitor_target in the framework's hash table.
> + *
> + * One HA clock invariant is enforced:
> + *   clk_elapsed < BUDGET_NS()   (active in all states)
> + *
> + * task_start uses da_handle_start_event() to set the initial state, then
> + * calls ha_reset_clk_ns() + ha_start_timer_ns() directly to initialise the
> + * clock and arm the budget timer.  No synthetic event is needed.
> + * The HA timer is cancelled synchronously by ha_cancel_timer_sync() in
> + * tlob_stop_task().
> + *
> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
> + */
> +#include <linux/completion.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/refcount.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +#include <rv/rv_uprobe.h>
> +#include <uapi/linux/rv.h>
> +#include "../../rv.h"
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <trace/events/sched.h>
> +#include <rv_trace.h>
> +
> +/*
> + * Per-fd private data; one instance per open /dev/rv fd.
> + * monitoring: set while TRACE_START is active; cleared at TRACE_STOP.
> + * budget_exceeded: set by hrtimer callback; read at TRACE_STOP to report
> + * -EOVERFLOW even when cleanup was claimed by a concurrent stop_all or
> + * a task-exit handler.
> + */
> +struct tlob_fpriv {
> +	struct task_struct	*task;
> +	bool			monitoring;
> +	bool			budget_exceeded;
> +};
> +
> +/*
> + * Per-task latency monitoring state.  One instance per monitoring window.
> + * Stored as monitor_target in da_monitor_storage; freed via call_rcu.
> + */
> +struct tlob_task_state {
> +	struct task_struct	*task;		/* via get_task_struct */
> +	u64			threshold_us;	/* budget in microseconds */
> +
> +	/* 1 = cleanup claimed; ha_setup_invariants won't restart the timer.
> */
> +	atomic_t		stopping;
> +
> +	/* Serialises the ns accumulators; held briefly (hardirq-safe). */
> +	raw_spinlock_t		entry_lock;
> +	u64			running_ns;	/* time in running state  */
> +	u64			waiting_ns;	/* time in waiting state  */
> +	u64			sleeping_ns;	/* time in sleeping state */
> +	ktime_t			last_ts;
> +
> +	/* store-release in TRACE_START ioctl, load-acquire in reset_notify.
> */
> +	struct tlob_fpriv	*fpriv;
> +
> +	struct rcu_head		rcu;		/* for call_rcu()
> teardown */
> +};
> +
> +#define RV_MON_TYPE RV_MON_PER_OBJ
> +#define HA_TIMER_TYPE HA_TIMER_HRTIMER
> +/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc.
> */
> +#define DA_SKIP_AUTO_ALLOC
> +
> +/* Type for da_monitor_storage.target; must be defined before the includes.
> */
> +typedef struct tlob_task_state *monitor_target;
> +
> +/* Forward-declared so da_monitor_reset_hook works before ha_monitor.h. */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon);
> +#define da_monitor_reset_hook tlob_reset_notify
> +
> +/*
> + * When the hrtimer fires (budget elapsed), the HA framework emits
> + * error_env_tlob with this label instead of the generic "none".
> + */
> +#define MONITOR_TIMER_EVENT_NAME "budget_exceeded"
> +
> +#include "tlob.h"
> +#include <rv/ha_monitor.h>
> +
> +/*
> + * Called from da_monitor_reset() on both normal stop and hrtimer expiry.
> + * On violation (stopping==0), emits detail_env_tlob.
> + */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon)
> +{
> +	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> +	struct tlob_task_state *ws;
> +
> +	ha_monitor_reset_env(da_mon);
> +
> +	ws = ha_get_target(ha_mon);
> +	if (!ws)
> +		return;
> +
> +	/*
> +	 * Emit per-state breakdown on budget violation only.
> +	 * stopping==0: timer callback owns this path (genuine overrun).
> +	 * stopping==1: normal stop claimed ownership first; skip.
> +	 */
> +	if (!atomic_read(&ws->stopping)) {
> +		unsigned int curr_state = READ_ONCE(da_mon->curr_state);
> +		u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
> +		struct tlob_fpriv *fp;
> +		unsigned long flags;
> +
> +		/*
> +		 * Snapshot accumulators; partial_ns covers curr_state time
> +		 * not yet folded in (transition-out pending).
> +		 */
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		partial_ns   = ktime_get_ns() - ktime_to_ns(ws->last_ts);
> +		running_ns   = ws->running_ns  +
> +			       (curr_state == running_tlob  ? partial_ns :
> 0);
> +		waiting_ns   = ws->waiting_ns  +
> +			       (curr_state == waiting_tlob  ? partial_ns :
> 0);
> +		sleeping_ns  = ws->sleeping_ns +
> +			       (curr_state == sleeping_tlob ? partial_ns :
> 0);
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +
> +		trace_detail_env_tlob(da_get_id(da_mon), ws->threshold_us,
> +				      running_ns, waiting_ns, sleeping_ns);
> +
> +		/*
> +		 * Latch violation in the fd so TRACE_STOP can return -
> EOVERFLOW
> +		 * even if a concurrent stop_all or task-exit handler claims
> +		 * cleanup first.  Pairs with smp_store_release in
> TRACE_START.
> +		 */
> +		fp = smp_load_acquire(&ws->fpriv);
> +		if (fp)
> +			WRITE_ONCE(fp->budget_exceeded, true);
> +	}
> +}
> +
> +#define BUDGET_US(ha_mon) (ha_get_target(ha_mon)->threshold_us)
> +#define BUDGET_NS(ha_mon) (BUDGET_US(ha_mon) * 1000ULL)
> +
> +/* HA constraint functions (called by ha_monitor_handle_constraint) */
> +
> +static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> +	if (env == clk_elapsed_tlob)
> +		return ha_get_clk_ns(ha_mon, env, time_ns);
> +	return ENV_INVALID_VALUE;
> +}
> +
> +static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> +	if (env == clk_elapsed_tlob)
> +		ha_reset_clk_ns(ha_mon, env, time_ns);
> +}
> +
> +/*
> + * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all states.
> + */
> +static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum events
> event,
> +					enum states next_state, u64 time_ns)
> +{
> +	if (curr_state == running_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	else if (curr_state == sleeping_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	else if (curr_state == waiting_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	return true;
> +}
> +
> +/*
> + * Convert invariant (deadline) to guard (reset anchor) on state transitions.
> + * Skip if uninitialised (ENV_INVALID_VALUE): the race between
> + * da_handle_start_event() and ha_reset_clk_ns() would give U64_MAX -
> BUDGET_NS.
> + */
> +static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum events
> event,
> +					enum states next_state, u64 time_ns)
> +{
> +	if (curr_state == next_state)
> +		return;
> +	if (curr_state == running_tlob &&
> +	    !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +	else if (curr_state == sleeping_tlob &&
> +		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +	else if (curr_state == waiting_tlob &&
> +		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +}
> +
> +/* No per-event guard conditions for tlob; invariants suffice. */
> +static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> +				    enum states curr_state, enum events
> event,
> +				    enum states next_state, u64 time_ns)
> +{
> +	return true;
> +}
> +
> +/*
> + * Arm or cancel the HA budget timer on state transitions.
> + * Guard on stopping: sched_switch events can arrive after
> ha_cancel_timer_sync,
> + * restarting the timer and triggering an ODEBUG "activate active" splat.
> + */
> +static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
> +				       enum states curr_state, enum events
> event,
> +				       enum states next_state, u64 time_ns)
> +{
> +	if (next_state == curr_state)
> +		return;
> +	if (next_state == running_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (next_state == sleeping_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (next_state == waiting_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (curr_state == running_tlob)
> +		ha_cancel_timer(ha_mon);
> +	else if (curr_state == waiting_tlob)
> +		ha_cancel_timer(ha_mon);
> +	else if (curr_state == sleeping_tlob)
> +		ha_cancel_timer(ha_mon);
> +}
> +
> +static bool ha_verify_constraint(struct ha_monitor *ha_mon,
> +				 enum states curr_state, enum events event,
> +				 enum states next_state, u64 time_ns)
> +{
> +	if (!ha_verify_invariants(ha_mon, curr_state, event, next_state,
> time_ns))
> +		return false;
> +
> +	ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
> +
> +	if (!ha_verify_guards(ha_mon, curr_state, event, next_state,
> time_ns))
> +		return false;
> +
> +	ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
> +
> +	return true;
> +}
> +
> +static struct kmem_cache *tlob_state_cache;
> +
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/*
> + * Serialises duplicate-check + da_create_or_get() to prevent two concurrent
> + * callers for the same pid from both inserting into the hash table.
> + */
> +static DEFINE_MUTEX(tlob_start_mutex);
> +
> +/*
> + * Counts open /dev/rv fds plus one synthetic ref held while enabled.
> + * __tlob_destroy_monitor() drops the synthetic ref and waits for zero
> + * before teardown, preventing kmem_cache_zalloc() on a destroyed cache.
> + */
> +static refcount_t tlob_fd_refcount = REFCOUNT_INIT(0);
> +static DECLARE_COMPLETION(tlob_fd_released);
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> +	struct list_head	list;
> +	u64			threshold_us;
> +	char			binpath[TLOB_MAX_PATH];
> +	loff_t			offset_start;
> +	loff_t			offset_stop;
> +	struct rv_uprobe	*start_probe;
> +	struct rv_uprobe	*stop_probe;
> +};
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu(struct rcu_head *head)
> +{
> +	struct tlob_task_state *ws =
> +		container_of(head, struct tlob_task_state, rcu);
> +	kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/*
> + * handle_sched_switch - advance the DA on every context switch.
> + *
> + * Generates three DA events:
> + *   prev, prev_state != 0  -> sleep_tlob    (running -> sleeping)
> + *   prev, prev_state == 0  -> preempt_tlob  (running -> waiting)
> + *   next                   -> switch_in_tlob (waiting -> running)
> + */
> +static void handle_sched_switch(void *data, bool preempt_unused,
> +				struct task_struct *prev,
> +				struct task_struct *next,
> +				unsigned int prev_state)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool do_prev = false, do_next = false;
> +	bool prev_preempted;
> +	ktime_t now;
> +

Perhaps keep the handler simpler by moving this reporting to a helper
function and use guard(rcu)() there.

> +	rcu_read_lock();
> +
> +	ws = da_get_target_by_id(prev->pid);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		now = ktime_get();
> +		ws->running_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		/* prev_state == 0: TASK_RUNNING (preempted); != 0: sleeping.
> */
> +		prev_preempted = (prev_state == 0);
> +		do_prev = true;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	ws = da_get_target_by_id(next->pid);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		now = ktime_get();
> +		ws->waiting_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		do_next = true;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	rcu_read_unlock();
> +

You probably don't need these. da_handle_event should skip tasks without
a monitor.

> +	if (do_prev)
> +		da_handle_event(prev->pid, NULL,
> +				prev_preempted ? preempt_tlob : sleep_tlob);
> +	if (do_next)
> +		da_handle_event(next->pid, NULL, switch_in_tlob);
> +}
> +
> +/*
> + * handle_sched_wakeup - sleeping -> waiting transition.
> + *
> + * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires for a
> + * task already in running or waiting state.
> + */
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool found = false;
> +

Same as above to keep the handler simple.

> +	rcu_read_lock();
> +	ws = da_get_target_by_id(p->pid);
> +	if (ws) {
> +		ktime_t now = ktime_get();
> +
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		ws->sleeping_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +		found = true;
> +	}
> +	rcu_read_unlock();
> +
> +	if (found)

You probably don't need this. da_handle_event should skip tasks without
a monitor.

> +		da_handle_event(p->pid, NULL, wakeup_tlob);
> +}
> +
> +/*
> + * handle_sched_process_exit - clean up if a task exits without TRACE_STOP.
> + *
> + * Called in do_exit() context; the task still has a valid pid here.
> + */
> +static void handle_sched_process_exit(void *data, struct task_struct *p,
> +				       bool group_dead)
> +{
> +	struct tlob_task_state *ws;
> +	bool found = false;
> +

> +	rcu_read_lock();
> +	ws = da_get_target_by_id(p->pid);
> +	found = !!ws;
> +	rcu_read_unlock();
> +
> +	if (found)

You can skip all this here.

> +		tlob_stop_task(p);
> +}
> +
> +
> +
> +/**
> + * tlob_start_task - begin monitoring @task with budget @threshold_us us.
> + * @task:         Task to monitor; may be current or another task.
> + * @threshold_us: Latency budget in microseconds (wall-clock; running +
> waiting + sleeping). > 0.
> + *
> + * Returns 0, -ENODEV, -EALREADY, -ENOSPC, or -ENOMEM.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us)
> +{
> +	struct tlob_task_state *ws_existing;
> +	struct tlob_task_state *ws;
> +	struct da_monitor *da_mon;
> +	struct ha_monitor *ha_mon;
> +	u64 now_ns;
> +	int ret;
> +
> +	if (!da_monitor_enabled())
> +		return -ENODEV;
> +
> +	if (threshold_us == 0)
> +		return -ERANGE;
> +
> +	/* Serialise duplicate-check + da_create_or_get for the same pid. */
> +	guard(mutex)(&tlob_start_mutex);
> +
> +	rcu_read_lock();

That should be a scoped_guard(rcu), definitely use guards if you have
return paths, the compiler is going to clean up (unlock) for you.

> +	ws_existing = da_get_target_by_id(task->pid);
> +	if (ws_existing) {
> +		rcu_read_unlock();
> +		return -EALREADY;
> +	}
> +	rcu_read_unlock();
> +
> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
> +	if (!ws)
> +		return -ENOMEM;
> +
> +	ws->task = task;
> +	get_task_struct(task);
> +	ws->threshold_us = threshold_us;
> +	ws->last_ts = ktime_get();
> +	raw_spin_lock_init(&ws->entry_lock);
> +
> +	/* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
> +	ret = da_create_or_get(task->pid, ws);
> +	if (ret) {
> +		put_task_struct(task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return ret;
> +	}
> +
> +	atomic_inc(&tlob_num_monitored);
> +
> +	/* Hold RCU across handle + timer setup to keep da_mon valid. */
> +	rcu_read_lock();

Same here about guards.
Sadly there doesn't seem to be a cleanup helper for kmem_cache_free,
would be worth adding one. You have also a lot of other things to do
here so it isn't a big deal.

> +	da_handle_start_event(task->pid, ws, switch_in_tlob);
> +	da_mon = da_get_monitor(task->pid, NULL);
> +	if (unlikely(!da_mon)) {
> +		/* Slot registered; missing da_mon means concurrent destroy.
> */
> +		rcu_read_unlock();
> +		da_destroy_storage(task->pid);
> +		atomic_dec(&tlob_num_monitored);
> +		put_task_struct(task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return -ENOMEM;
> +	}
> +	ha_mon = to_ha_monitor(da_mon);
> +	now_ns = ktime_get_ns();
> +	ha_reset_env(ha_mon, clk_elapsed_tlob, now_ns);
> +	ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> now_ns);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task.
> + * @task: Task to stop.
> + *
> + * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
> + * the winner cancels the timer synchronously and frees all resources.
> + *
> + * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
> + * or -EAGAIN (concurrent caller claimed cleanup).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> +	struct da_monitor *da_mon;
> +	struct ha_monitor *ha_mon;
> +	struct tlob_task_state *ws;
> +	bool budget_exceeded;
> +
> +	rcu_read_lock();
> +	ws = da_get_target_by_id(task->pid);
> +	if (!ws) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	da_mon = da_get_monitor(task->pid, NULL);
> +	if (unlikely(!da_mon)) {
> +		/* ws in hash but da_mon gone; internal inconsistency. */
> +		rcu_read_unlock();
> +		WARN_ON_ONCE(1);
> +		return -ESRCH;
> +	}
> +
> +	ha_mon = to_ha_monitor(da_mon);
> +
> +	/*
> +	 * CAS (0->1) claims cleanup ownership under RCU (ws guaranteed
> valid).
> +	 * _release pairs with atomic_read_acquire in ha_setup_invariants.
> +	 */
> +	if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> +		rcu_read_unlock();
> +		return -EAGAIN;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	/* Wait for in-flight timer callback before reading da_monitoring. */
> +	ha_cancel_timer_sync(ha_mon);
> +
> +	/* Timer fired first -> budget exceeded; otherwise reset normally. */
> +	rcu_read_lock();
> +	budget_exceeded = !da_monitoring(da_mon);
> +	if (!budget_exceeded)
> +		da_monitor_reset(da_mon);
> +	rcu_read_unlock();
> +	da_destroy_storage(task->pid);
> +	atomic_dec(&tlob_num_monitored);
> +
> +	put_task_struct(ws->task);
> +	call_rcu(&ws->rcu, tlob_free_rcu);
> +	return budget_exceeded ? -EOVERFLOW : 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +static void tlob_stop_all(void)
> +{

All this function does should be done by da_monitor_destroy. It does
have some concurrency issues I'm trying to fix, but there's no reason
not to use it.

We could add a way to pass some additional deallocation for all the
other cleanup you're doing on each storage.

Something like a da_extra_cleanup() you can define as whatever you need
and gets called in all per-obj destruction paths.

In general, let's try to use/extend as much as possible in the RV API
rather then re-implementing things.

> +	struct da_monitor_storage *ms;
> +	pid_t pids[TLOB_MAX_MONITORED];
> +	int bkt, n = 0;
> +
> +	/* Snapshot pids under RCU; re-derive ws under a fresh lock below. */
> +	rcu_read_lock();
> +	hash_for_each_rcu(da_monitor_ht, bkt, ms, node) {
> +		if (ms->target && n < TLOB_MAX_MONITORED)
> +			pids[n++] = ms->id;
> +	}
> +	rcu_read_unlock();
> +
> +	for (int i = 0; i < n; i++) {
> +		pid_t pid = pids[i];
> +		struct da_monitor *da_mon;
> +		struct ha_monitor *ha_mon;
> +		struct tlob_task_state *ws;
> +
> +		rcu_read_lock();
> +		da_mon = da_get_monitor(pid, NULL);
> +		if (!da_mon) {
> +			/* Cleaned up by tlob_stop_task or exit handler. */
> +			rcu_read_unlock();
> +			continue;
> +		}
> +
> +		ws = da_get_target(da_mon);
> +		ha_mon = to_ha_monitor(da_mon);
> +
> +		/* CAS (0->1) claims ownership; skip if another caller won.
> */
> +		if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> +			rcu_read_unlock();
> +			continue;
> +		}
> +		rcu_read_unlock();
> +
> +		ha_cancel_timer_sync(ha_mon);
> +
> +		scoped_guard(rcu) {
> +			da_monitor_reset(da_mon);
> +		}
> +		da_destroy_storage(pid);
> +		atomic_dec(&tlob_num_monitored);
> +		put_task_struct(ws->task);
> +		call_rcu(&ws->rcu, tlob_free_rcu);
> +	}
> +}
> +
> +static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> +				     __u64 *data)
> +{
> +	struct tlob_uprobe_binding *b = p->priv;
> +
> +	tlob_start_task(current, b->threshold_us);
> +	return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> +				    __u64 *data)
> +{
> +	tlob_stop_task(current);
> +	return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> +			   loff_t offset_start, loff_t offset_stop)
> +{
> +	struct tlob_uprobe_binding *b, *tmp_b;
> +	char pathbuf[TLOB_MAX_PATH];
> +	struct path path;
> +	char *canon;
> +	int ret;
> +
> +	if (binpath[0] != '/')
> +		return -EINVAL;
> +
> +	b = kzalloc_obj(*b, GFP_KERNEL);
> +	if (!b)
> +		return -ENOMEM;
> +
> +	b->threshold_us = threshold_us;
> +	b->offset_start = offset_start;
> +	b->offset_stop  = offset_stop;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
> +	if (ret)
> +		goto err_free;
> +
> +	if (!d_is_reg(path.dentry)) {
> +		ret = -EINVAL;
> +		goto err_path;
> +	}
> +
> +	/* Reject duplicate start offset for the same binary. */
> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> +		if (tmp_b->offset_start == offset_start &&
> +		    tmp_b->start_probe->path.dentry == path.dentry) {
> +			ret = -EEXIST;
> +			goto err_path;
> +		}
> +	}
> +
> +	canon = d_path(&path, pathbuf, sizeof(pathbuf));
> +	if (IS_ERR(canon)) {
> +		ret = PTR_ERR(canon);
> +		goto err_path;
> +	}
> +	strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> +	/* Both probes share b (priv) and path; attach_path refs path itself.
> */
> +	b->start_probe = rv_uprobe_attach_path(&path, offset_start,
> +					       tlob_uprobe_entry_handler,
> NULL, b);
> +	if (IS_ERR(b->start_probe)) {
> +		ret = PTR_ERR(b->start_probe);
> +		b->start_probe = NULL;
> +		goto err_path;
> +	}
> +
> +	b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
> +					      tlob_uprobe_stop_handler, NULL,
> b);
> +	if (IS_ERR(b->stop_probe)) {
> +		ret = PTR_ERR(b->stop_probe);
> +		b->stop_probe = NULL;
> +		goto err_start;
> +	}
> +
> +	path_put(&path);
> +	list_add_tail(&b->list, &tlob_uprobe_list);
> +	return 0;
> +
> +err_start:
> +	rv_uprobe_detach(b->start_probe);
> +err_path:
> +	path_put(&path);
> +err_free:
> +	kfree(b);
> +	return ret;
> +}
> +
> +static int tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +	struct path remove_path;
> +	int ret;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
> +	if (ret)
> +		return ret;
> +
> +	ret = -ENOENT;
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		if (b->offset_start != offset_start)
> +			continue;
> +		if (b->start_probe->path.dentry != remove_path.dentry)
> +			continue;
> +		list_del(&b->list);
> +		rv_uprobe_detach(b->start_probe);
> +		rv_uprobe_detach(b->stop_probe);
> +		kfree(b);
> +		ret = 0;
> +		break;
> +	}
> +
> +	path_put(&remove_path);
> +	return ret;
> +}
> +
> +static void tlob_remove_all_uprobes(void)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +	LIST_HEAD(pending);
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		list_move(&b->list, &pending);
> +		rv_uprobe_unregister_nosync(b->start_probe);
> +		rv_uprobe_unregister_nosync(b->stop_probe);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	if (list_empty(&pending))
> +		return;
> +
> +	/*
> +	 * One global barrier for all probes dequeued above; no new handlers
> +	 * for any of them can fire after this returns.
> +	 */
> +	rv_uprobe_sync();
> +
> +	list_for_each_entry_safe(b, tmp, &pending, list) {
> +		rv_uprobe_free(b->start_probe);
> +		rv_uprobe_free(b->stop_probe);
> +		kfree(b);
> +	}
> +}
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> +				 char __user *ubuf,
> +				 size_t count, loff_t *ppos)
> +{
> +	const int line_sz = TLOB_MAX_PATH + 128;
> +	struct tlob_uprobe_binding *b;
> +	char *buf, *p;
> +	int n = 0, buf_sz, pos = 0;
> +	ssize_t ret;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry(b, &tlob_uprobe_list, list)
> +		n++;
> +
> +	buf_sz = (n ? n : 1) * line_sz + 1;
> +	buf = kmalloc(buf_sz, GFP_KERNEL);
> +	if (!buf) {
> +		mutex_unlock(&tlob_uprobe_mutex);
> +		return -ENOMEM;
> +	}
> +
> +	list_for_each_entry(b, &tlob_uprobe_list, list) {
> +		p = b->binpath;
> +		pos += scnprintf(buf + pos, buf_sz - pos,
> +				 "p %s:0x%llx 0x%llx threshold=%llu\n",
> +				 p,
> +				 (unsigned long long)b->offset_start,
> +				 (unsigned long long)b->offset_stop,
> +				 b->threshold_us);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/*
> + * Parse "p PATH:OFFSET_START OFFSET_STOP threshold=US".
> + * PATH may contain ':'; the last ':' separates path from offset.
> + * Returns 0 or -EINVAL.
> + */
> +static int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +				  char **path_out,
> +				  loff_t *start_out, loff_t *stop_out)
> +{
> +	unsigned long long thr = 0, stop_val = 0;
> +	long long start_val;
> +	char *p, *path_token, *token, *colon;
> +	bool got_stop = false, got_thr = false;
> +	int n;
> +
> +	/* Must start with "p " */
> +	if (buf[0] != 'p' || buf[1] != ' ')
> +		return -EINVAL;
> +
> +	p = buf + 2;
> +	while (*p == ' ')
> +		p++;
> +
> +	/* First space-delimited token is PATH:OFFSET_START */
> +	path_token = strsep(&p, " \t");
> +	if (!path_token || !*path_token)
> +		return -EINVAL;
> +
> +	/* Split at last ':' to handle paths that contain ':'. */
> +	colon = strrchr(path_token, ':');
> +	if (!colon || colon - path_token < 2)
> +		return -EINVAL;
> +	*colon = '\0';
> +
> +	if (path_token[0] != '/')
> +		return -EINVAL;
> +
> +	n = 0;
> +	if (sscanf(colon + 1, "%lli%n", &start_val, &n) != 1 || n == 0)
> +		return -EINVAL;
> +	if (start_val < 0)
> +		return -EINVAL;
> +
> +	/* Remaining tokens: OFFSET_STOP threshold=US */
> +	while (p && (token = strsep(&p, " \t")) != NULL) {
> +		if (!*token)
> +			continue;
> +		if (strncmp(token, "threshold=", 10) == 0) {
> +			if (kstrtoull(token + 10, 0, &thr))
> +				return -EINVAL;
> +			got_thr = true;
> +		} else if (!got_stop) {
> +			long long sv;
> +
> +			n = 0;
> +			if (sscanf(token, "%lli%n", &sv, &n) != 1 || n == 0)
> +				return -EINVAL;
> +			if (sv < 0)
> +				return -EINVAL;
> +			stop_val = (unsigned long long)sv;
> +			got_stop = true;
> +		} else {
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (!got_stop || !got_thr || thr == 0)
> +		return -EINVAL;
> +	if (start_val == (long long)stop_val)
> +		return -EINVAL;
> +
> +	*thr_out   = thr;
> +	*path_out  = path_token;
> +	*start_out = (loff_t)start_val;
> +	*stop_out  = (loff_t)stop_val;
> +	return 0;
> +}
> +
> +/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal convention). */
> +static int tlob_parse_remove_line(char *buf, char **path_out, loff_t
> *start_out)
> +{
> +	char *binpath, *colon;
> +	long long off;
> +	int n = 0;
> +
> +	if (buf[0] != '-')
> +		return -EINVAL;
> +	binpath = buf + 1;
> +	if (binpath[0] != '/')
> +		return -EINVAL;
> +	colon = strrchr(binpath, ':');
> +	if (!colon || colon - binpath < 2)
> +		return -EINVAL;
> +	*colon = '\0';
> +	if (sscanf(colon + 1, "%lli%n", &off, &n) != 1 || n == 0)
> +		return -EINVAL;
> +	*path_out  = binpath;
> +	*start_out = (loff_t)off;
> +	return 0;
> +}
> +
> +VISIBLE_IF_KUNIT int tlob_create_or_delete_uprobe(char *buf)
> +{
> +	loff_t offset_start, offset_stop;
> +	u64 threshold_us;
> +	char *binpath;
> +	int ret;
> +
> +	if (buf[0] == '-') {
> +		ret = tlob_parse_remove_line(buf, &binpath, &offset_start);
> +		if (ret)
> +			return ret;
> +		mutex_lock(&tlob_uprobe_mutex);
> +		ret = tlob_remove_uprobe_by_key(offset_start, binpath);
> +		mutex_unlock(&tlob_uprobe_mutex);
> +		return ret;
> +	}
> +	ret = tlob_parse_uprobe_line(buf, &threshold_us, &binpath,
> +				     &offset_start, &offset_stop);
> +	if (ret)
> +		return ret;
> +	mutex_lock(&tlob_uprobe_mutex);
> +	ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> +	mutex_unlock(&tlob_uprobe_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_create_or_delete_uprobe);
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> +				  const char __user *ubuf,
> +				  size_t count, loff_t *ppos)
> +{
> +	char buf[TLOB_MAX_PATH + 128];
> +
> +	if (count >= sizeof(buf))
> +		return -EINVAL;
> +	if (copy_from_user(buf, ubuf, count))
> +		return -EFAULT;
> +	buf[count] = '\0';
> +	if (count > 0 && buf[count - 1] == '\n')
> +		buf[count - 1] = '\0';
> +	return tlob_create_or_delete_uprobe(buf) ?: (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> +	.open	= simple_open,
> +	.read	= tlob_monitor_read,
> +	.write	= tlob_monitor_write,
> +	.llseek	= noop_llseek,
> +};
> +
> +static int __tlob_init_monitor(void)
> +{
> +	int retval;
> +
> +	tlob_state_cache = kmem_cache_create("tlob_task_state",
> +					     sizeof(struct tlob_task_state),
> +					     0, 0, NULL);
> +	if (!tlob_state_cache)
> +		return -ENOMEM;
> +
> +	atomic_set(&tlob_num_monitored, 0);
> +
> +	retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
> +	if (retval) {
> +		kmem_cache_destroy(tlob_state_cache);
> +		tlob_state_cache = NULL;
> +		return retval;
> +	}
> +
> +	/* Synthetic reference: held while the monitor is enabled. */
> +	reinit_completion(&tlob_fd_released);
> +	refcount_set(&tlob_fd_refcount, 1);
> +
> +	rv_this.enabled = 1;
> +	return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> +	rv_this.enabled = 0;
> +	/*
> +	 * Remove uprobes first so stop_task can't race with tlob_stop_all().
> +	 * rv_uprobe_sync() inside ensures all in-flight handlers have
> finished.
> +	 */
> +	tlob_remove_all_uprobes();
> +	tlob_stop_all();
> +	/* Wait for tlob_free_rcu and da_pool_return_cb before pool teardown.
> */
> +	synchronize_rcu();
> +
> +	/*
> +	 * Drop the synthetic ref and wait for all open fds to close before
> +	 * teardown; prevents kmem_cache_zalloc() on the destroyed cache.
> +	 */
> +	if (!refcount_dec_and_test(&tlob_fd_refcount))
> +		wait_for_completion(&tlob_fd_released);
> +
> +	da_monitor_destroy();
> +	kmem_cache_destroy(tlob_state_cache);
> +	tlob_state_cache = NULL;
> +}
> +
> +/* KUnit wrappers that acquire rv_interface_lock around monitor init/destroy.
> */
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&rv_interface_lock);
> +	ret = __tlob_init_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tlob_init_monitor);
> +
> +void tlob_destroy_monitor(void)
> +{
> +	mutex_lock(&rv_interface_lock);
> +	__tlob_destroy_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_GPL(tlob_destroy_monitor);
> +
> +int tlob_num_monitored_read(void)
> +{
> +	return atomic_read(&tlob_num_monitored);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_num_monitored_read);
> +
> +/* Tracepoint probes for KUnit; rv_trace.h is only included here. */
> +static struct tlob_captured_event     tlob_kunit_last_event;
> +static struct tlob_captured_error_env tlob_kunit_last_error_env;
> +static atomic_t tlob_kunit_event_cnt    = ATOMIC_INIT(0);
> +static atomic_t tlob_kunit_error_env_cnt = ATOMIC_INIT(0);
> +
> +static void tlob_kunit_event_probe(void *data, int id, char *state, char
> *event,
> +				   char *next_state, bool final_state)
> +{
> +	tlob_kunit_last_event.id = id;
> +	strscpy(tlob_kunit_last_event.state, state,
> +		sizeof(tlob_kunit_last_event.state));
> +	strscpy(tlob_kunit_last_event.event, event,
> +		sizeof(tlob_kunit_last_event.event));
> +	strscpy(tlob_kunit_last_event.next_state, next_state,
> +		sizeof(tlob_kunit_last_event.next_state));
> +	tlob_kunit_last_event.final_state = final_state;
> +	atomic_inc(&tlob_kunit_event_cnt);
> +}
> +
> +static void tlob_kunit_error_env_probe(void *data, int id, char *state,
> +				       char *event, char *env)
> +{
> +	tlob_kunit_last_error_env.id = id;
> +	strscpy(tlob_kunit_last_error_env.state, state,
> +		sizeof(tlob_kunit_last_error_env.state));
> +	strscpy(tlob_kunit_last_error_env.event, event,
> +		sizeof(tlob_kunit_last_error_env.event));
> +	strscpy(tlob_kunit_last_error_env.env, env,
> +		sizeof(tlob_kunit_last_error_env.env));
> +	atomic_inc(&tlob_kunit_error_env_cnt);
> +}
> +
> +int tlob_register_kunit_probes(void)
> +{
> +	int ret;
> +
> +	atomic_set(&tlob_kunit_event_cnt, 0);
> +	atomic_set(&tlob_kunit_error_env_cnt, 0);
> +
> +	ret = register_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +	if (ret)
> +		return ret;
> +	ret = register_trace_error_env_tlob(tlob_kunit_error_env_probe,
> NULL);
> +	if (ret) {
> +		unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +		return ret;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_register_kunit_probes);
> +
> +void tlob_unregister_kunit_probes(void)
> +{
> +	unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +	unregister_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
> +	tracepoint_synchronize_unregister();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_unregister_kunit_probes);
> +
> +int tlob_event_count_read(void)
> +{
> +	return atomic_read(&tlob_kunit_event_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_read);
> +
> +void tlob_event_count_reset(void)
> +{
> +	atomic_set(&tlob_kunit_event_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_reset);
> +
> +int tlob_error_env_count_read(void)
> +{
> +	return atomic_read(&tlob_kunit_error_env_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_read);
> +
> +void tlob_error_env_count_reset(void)
> +{
> +	atomic_set(&tlob_kunit_error_env_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_reset);
> +
> +const struct tlob_captured_event *tlob_last_event_read(void)
> +{
> +	return &tlob_kunit_last_event;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_event_read);
> +
> +const struct tlob_captured_error_env *tlob_last_error_env_read(void)
> +{
> +	return &tlob_kunit_last_error_env;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_error_env_read);
> +
> +#endif /* CONFIG_KUNIT */
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> +	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +	rv_attach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> +	return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> +	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +	rv_detach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +static int enable_tlob(void)
> +{
> +	int retval;
> +
> +	retval = __tlob_init_monitor();
> +	if (retval)
> +		return retval;
> +
> +	return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> +	tlob_disable_hooks();
> +	__tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> +	.name		= "tlob",
> +	.description	= "Per-task latency-over-budget monitor.",
> +	.enable		= enable_tlob,
> +	.disable	= disable_tlob,
> +	.reset		= da_monitor_reset_all,
> +	.enabled	= 0,
> +};
> +
> +static void *tlob_chardev_bind(void)
> +{
> +	struct tlob_fpriv *fp;
> +
> +	fp = kzalloc_obj(*fp, GFP_KERNEL);
> +	if (!fp)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Pin cache/pool for fd lifetime; balanced in tlob_chardev_release.
> +	 * If the synthetic ref has already been dropped
> (__tlob_destroy_monitor
> +	 * ran to completion), reject the bind so the caller gets ENODEV
> instead
> +	 * of corrupting a zero refcount.
> +	 */
> +	if (!refcount_inc_not_zero(&tlob_fd_refcount)) {
> +		kfree(fp);
> +		return ERR_PTR(-ENODEV);
> +	}
> +	return fp;
> +}
> +
> +static void tlob_chardev_release(void *priv)
> +{
> +	struct tlob_fpriv *fp = priv;
> +
> +	if (fp->monitoring) {
> +		/* All return values are safe on close. */
> +		(void)tlob_stop_task(fp->task);
> +		put_task_struct(fp->task);
> +	}
> +
> +	kfree(fp);
> +
> +	/* Release fd's pin; if last, wake __tlob_destroy_monitor. */
> +	if (refcount_dec_and_test(&tlob_fd_refcount))
> +		complete(&tlob_fd_released);
> +}
> +
> +static long tlob_chardev_ioctl(void *priv, unsigned int cmd, unsigned long
> arg)
> +{
> +	struct tlob_fpriv *fp = priv;
> +	struct tlob_start_args args;
> +	struct task_struct *task;
> +	int ret;
> +
> +	switch (cmd) {
> +	case TLOB_IOCTL_TRACE_START:
> +		if (fp->monitoring)
> +			return -EALREADY;
> +
> +		if (copy_from_user(&args, (void __user *)arg, sizeof(args)))
> +			return -EFAULT;
> +
> +		ret = tlob_start_task(current, args.threshold_us);
> +		if (ret)
> +			return ret;
> +
> +		fp->task = current;
> +		get_task_struct(current);
> +		fp->budget_exceeded = false;
> +
> +		/* Link fd so hrtimer callback can latch budget_exceeded. */
> +		scoped_guard(rcu) {
> +			struct tlob_task_state *ws =
> da_get_target_by_id(current->pid);
> +
> +			if (ws)
> +				smp_store_release(&ws->fpriv, fp);
> +		}
> +
> +		fp->monitoring = true;
> +		return 0;
> +
> +	case TLOB_IOCTL_TRACE_STOP:
> +		if (!fp->monitoring)
> +			return -EINVAL;
> +
> +		task = fp->task;
> +		fp->monitoring = false;
> +		fp->task = NULL;
> +
> +		ret = tlob_stop_task(task);
> +		put_task_struct(task);
> +
> +		/*
> +		 * -EOVERFLOW: budget exceeded; propagate to caller.
> +		 * -EAGAIN: concurrent stop_all claimed cleanup; fall through
> to
> +		 *   budget_exceeded latch set by the hrtimer callback.
> +		 * -ESRCH: task exited before TRACE_STOP (process-exit
> handler
> +		 *   claimed cleanup); same latch applies.  Not an internal
> error.
> +		 */
> +		if (ret == -EAGAIN || ret == -ESRCH)
> +			return READ_ONCE(fp->budget_exceeded) ? -EOVERFLOW :
> 0;
> +		return ret;
> +
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +static const struct rv_chardev_ops tlob_chardev_ops = {
> +	.owner   = THIS_MODULE,
> +	.bind    = tlob_chardev_bind,
> +	.ioctl   = tlob_chardev_ioctl,
> +	.release = tlob_chardev_release,
> +};
> +
> +static int __init register_tlob(void)
> +{
> +	int ret;
> +
> +	ret = rv_chardev_register_monitor("tlob", &tlob_chardev_ops);
> +	if (ret)
> +		return ret;
> +
> +	ret = rv_register_monitor(&rv_this, NULL);
> +	if (ret) {
> +		rv_chardev_unregister_monitor("tlob");
> +		return ret;
> +	}
> +
> +	if (rv_this.root_d) {
> +		if (!tracefs_create_file("monitor", 0644, rv_this.root_d,
> NULL,
> +					 &tlob_monitor_fops)) {
> +			rv_unregister_monitor(&rv_this);
> +			rv_chardev_unregister_monitor("tlob");
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> +	rv_chardev_unregister_monitor("tlob");
> +	rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000000..71c1735d27d2
> --- /dev/null
[...]
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102f17..a45c4763dbe5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -142,10 +142,17 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/slab.h>
> +#include <kunit/visibility.h>
>  
>  #ifdef CONFIG_RV_MON_EVENTS
>  #define CREATE_TRACE_POINTS
>  #include <rv_trace.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(event_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_env_tlob);
> +#endif

Cannot this stay in tlob.c ? So you keep the shared file clean and skip
the ifdeffery.

>  #endif
>  
>  #include "rv.h"
> @@ -696,6 +703,33 @@ static void turn_monitoring_on(void)
>  	WRITE_ONCE(monitoring_on, true);
>  }
>  
> +#if IS_ENABLED(CONFIG_KUNIT)
> +/**
> + * rv_kunit_monitoring_on - enable the global monitoring_on flag for KUnit
> tests.
> + *
> + * KUnit test suite_init functions must call this before initialising any
> + * monitor, mirroring the turn_monitoring_on() call in rv_init_interface().
> + * The matching rv_kunit_monitoring_off() must be called in suite_exit to
> + * restore the flag so that test suites do not interfere with each other.
> + */
> +void rv_kunit_monitoring_on(void)
> +{
> +	turn_monitoring_on();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_on);
> +
> +/**
> + * rv_kunit_monitoring_off - disable the global monitoring_on flag for KUnit
> tests.
> + *
> + * Must be called in suite_exit to restore global state after
> rv_kunit_monitoring_on().
> + */
> +void rv_kunit_monitoring_off(void)
> +{
> +	turn_monitoring_off();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_off);
> +#endif /* CONFIG_KUNIT */
> +
>  static void turn_monitoring_on_with_reset(void)
>  {
>  	lockdep_assert_held(&rv_interface_lock);
> @@ -846,6 +880,10 @@ int __init rv_init_interface(void)
>  	if (retval)
>  		return 1;
>  
> +	retval = rv_chardev_init();
> +	if (retval)
> +		return 1;
> +

Both of those can stay in separate patches as mentioned above.

>  	turn_monitoring_on();
>  
>  	rv_root.root_dir = no_free_ptr(root_dir);
> diff --git a/kernel/trace/rv/rv.h b/kernel/trace/rv/rv.h
> index 2c0f51ff9d5c..82c9a2b57596 100644
> --- a/kernel/trace/rv/rv.h
> +++ b/kernel/trace/rv/rv.h
> @@ -31,6 +31,8 @@ int rv_enable_monitor(struct rv_monitor *mon);
>  bool rv_is_container_monitor(struct rv_monitor *mon);
>  bool rv_is_nested_monitor(struct rv_monitor *mon);
>  
> +int rv_chardev_init(void);
> +

Same here.

>  #ifdef CONFIG_RV_REACTORS
>  int reactor_populate_monitor(struct rv_monitor *mon, struct dentry *root);
>  int init_rv_reactors(struct dentry *root_dir);
> diff --git a/kernel/trace/rv/rv_chardev.c b/kernel/trace/rv/rv_chardev.c
> new file mode 100644
> index 000000000000..1fba1642ebc1
> --- /dev/null
> +++ b/kernel/trace/rv/rv_chardev.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +

And here.

> diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
> index bc28399cfd4b..1ba7b80c1d87 100644
> --- a/kernel/trace/rv/rv_uprobe.c
> +++ b/kernel/trace/rv/rv_uprobe.c

Also this probably belongs in the uprobes patch.

> @@ -132,13 +132,10 @@ EXPORT_SYMBOL_GPL(rv_uprobe_attach);
>   */
>  void rv_uprobe_detach(struct rv_uprobe *p)
>  {
> -	struct rv_uprobe_impl *impl;
> -
>  	if (!p)
>  		return;
>  
> -	impl = container_of(p, struct rv_uprobe_impl, pub);
> -	uprobe_unregister_nosync(impl->uprobe, &impl->uc);
> +	rv_uprobe_unregister_nosync(p);
>  	/*
>  	 * uprobe_unregister_sync() is a global barrier: it waits for all
>  	 * in-flight uprobe handlers across the entire system to complete,
> @@ -146,8 +143,47 @@ void rv_uprobe_detach(struct rv_uprobe *p)
>  	 * guarantees that no handler touching impl->pub.priv is running by
>  	 * the time we return, even if the caller immediately frees priv.
>  	 */
> +	rv_uprobe_sync();
> +	rv_uprobe_free(p);
> +}
> +EXPORT_SYMBOL_GPL(rv_uprobe_detach);

[...]

> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */

And this in a new ioctl patch.

> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H

Thanks,
Gabriele


^ permalink raw reply

* [RFC v8 7/7] ext4: fast commit: export snapshot stats in fc_info
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v8:
- Treat stale snapshot inode sizing as a capacity fallback instead of
  letting log writing later report a missing snapshot.
- Use atomic64_t for the snapshot counters so fc_info cannot observe
  torn 64-bit values on 32-bit systems.

Changes in v7:
- Address Sashiko review by using READ_ONCE() + div64_u64() for the fc_info
  lock_updates average.

Changes in v6:
- Start consuming locked_ns in fc_info, so this patch intentionally moves
  lock_updates_ns_{total,max,samples} accounting here.
- Guard the tracepoint call with trace_ext4_fc_lock_updates_enabled() and
  use trace_call__ext4_fc_lock_updates() to avoid the double static_branch
  at the guarded call site.
- Keep the stats unconditionally while avoiding extra tracepoint
  overhead when ext4_fc_lock_updates is disabled.

 fs/ext4/ext4.h        |  31 ++++++++++++++
 fs/ext4/fast_commit.c |  96 ++++++++++++++++++++++++++++++++++++++-----
 fs/ext4/super.c       |   1 +
 3 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index dd09d00a73af..ddc903738c6b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1550,6 +1550,36 @@ struct ext4_orphan_info {
 						 * file blocks */
 };
 
+/*
+ * Ext4 fast commit snapshot statistics.
+ *
+ * These are best-effort counters intended for debugging / performance
+ * introspection; they are not exact under concurrent updates.
+ */
+struct ext4_fc_snap_stats {
+	atomic64_t lock_updates_ns_total;
+	atomic64_t lock_updates_ns_max;
+	atomic64_t lock_updates_samples;
+
+	atomic64_t snap_inodes;
+	atomic64_t snap_ranges;
+
+	atomic64_t snap_fail_es_miss;
+	atomic64_t snap_fail_es_delayed;
+	atomic64_t snap_fail_es_other;
+
+	atomic64_t snap_fail_inodes_cap;
+	atomic64_t snap_fail_ranges_cap;
+	atomic64_t snap_fail_nomem;
+	atomic64_t snap_fail_inode_loc;
+
+	/*
+	 * Missing inode snapshots during log writing should never happen.
+	 * Keep this counter to help catch unexpected regressions.
+	 */
+	atomic64_t snap_fail_no_snap;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1824,6 +1854,7 @@ struct ext4_sb_info {
 	struct mutex s_fc_lock;
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
+	struct ext4_fc_snap_stats s_fc_snap_stats;
 	tid_t s_fc_ineligible_tid;
 #ifdef CONFIG_EXT4_DEBUG
 	int s_fc_debug_max_replay;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index dc08f8ff43d9..4ef796b9b6cb 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -281,6 +281,19 @@ static inline void ext4_fc_wake_inode_state(struct inode *inode, int bit)
 		    ext4_inode_state_wait_bit(bit));
 }
 
+static void ext4_fc_snap_stats_update_max(atomic64_t *stat, u64 value)
+{
+	u64 old = atomic64_read(stat);
+
+	while (value > old) {
+		u64 prev = atomic64_cmpxchg(stat, old, value);
+
+		if (prev == old)
+			break;
+		old = prev;
+	}
+}
+
 /*
  * Remove inode from fast commit list. If the inode is being committed
  * we wait until inode commit is done.
@@ -868,6 +881,8 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
@@ -875,13 +890,17 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	int inode_len;
 	int ret;
 
-	if (!snap)
+	if (!snap) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	src = snap->inode_buf;
 	inode_len = snap->inode_len;
-	if (!src || inode_len == 0)
+	if (!src || inode_len == 0) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -911,13 +930,17 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
 	struct ext4_fc_range *range;
 
-	if (!snap)
+	if (!snap) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	list_for_each_entry(range, &snap->data_list, list) {
 		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
@@ -978,6 +1001,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
 	unsigned int nr_ranges = 0;
 
@@ -1005,11 +1030,13 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
 		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			atomic64_inc(&stats->snap_fail_es_miss);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
 		}
 
 		if (ext4_es_is_delayed(&es)) {
+			atomic64_inc(&stats->snap_fail_es_delayed);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
@@ -1024,6 +1051,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		}
 
 		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			atomic64_inc(&stats->snap_fail_ranges_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
@@ -1031,6 +1059,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range) {
+			atomic64_inc(&stats->snap_fail_nomem);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
 		}
@@ -1058,6 +1087,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			atomic64_inc(&stats->snap_fail_es_other);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
@@ -1081,6 +1111,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
@@ -1091,6 +1123,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
 	if (ret) {
+		atomic64_inc(&stats->snap_fail_inode_loc);
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
 	}
@@ -1102,6 +1135,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		atomic64_inc(&stats->snap_fail_nomem);
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
@@ -1126,6 +1160,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	atomic64_inc(&stats->snap_inodes);
+	atomic64_add(nr_ranges, &stats->snap_ranges);
 	if (nr_rangesp)
 		*nr_rangesp = nr_ranges;
 	return 0;
@@ -1229,12 +1265,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	int ret = 0;
 	int alloc_ctx;
 
-	if (!inodes_size)
-		return 0;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			atomic64_inc(&sbi->s_fc_snap_stats.snap_fail_inodes_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1260,6 +1294,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			atomic64_inc(&sbi->s_fc_snap_stats.snap_fail_inodes_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1303,6 +1338,7 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
@@ -1362,8 +1398,13 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 		return ret;
 
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
-	if (ret)
+	if (ret) {
+		if (ret == -E2BIG)
+			atomic64_inc(&snap_stats->snap_fail_inodes_cap);
+		else if (ret == -ENOMEM)
+			atomic64_inc(&snap_stats->snap_fail_nomem);
 		return ret;
+	}
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
@@ -1384,12 +1425,15 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
 				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
-	if (trace_ext4_fc_lock_updates_enabled()) {
-		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
-		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
-						 snap_inodes, snap_ranges,
-						 ret, snap_err);
-	}
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	atomic64_add(locked_ns, &snap_stats->lock_updates_ns_total);
+	atomic64_inc(&snap_stats->lock_updates_samples);
+	ext4_fc_snap_stats_update_max(&snap_stats->lock_updates_ns_max,
+				      locked_ns);
+	if (trace_ext4_fc_lock_updates_enabled())
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -2657,11 +2701,26 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 {
 	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
 	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
+	u64 lock_avg_ns = 0;
+	u64 lock_updates_samples;
+	u64 lock_updates_ns_total;
+	u64 lock_updates_ns_max;
 	int i;
 
 	if (v != SEQ_START_TOKEN)
 		return 0;
 
+	lock_updates_samples =
+		atomic64_read(&snap_stats->lock_updates_samples);
+	lock_updates_ns_total =
+		atomic64_read(&snap_stats->lock_updates_ns_total);
+	lock_updates_ns_max =
+		atomic64_read(&snap_stats->lock_updates_ns_max);
+	if (lock_updates_samples)
+		lock_avg_ns = div64_u64(lock_updates_ns_total,
+					lock_updates_samples);
+
 	seq_printf(seq,
 		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
 		   stats->fc_num_commits, stats->fc_ineligible_commits,
@@ -2672,6 +2731,23 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
 			stats->fc_ineligible_reason_count[i]);
 
+	seq_printf(seq,
+		   "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n%lluus lock_updates_max\n",
+		   atomic64_read(&snap_stats->snap_inodes),
+		   atomic64_read(&snap_stats->snap_ranges),
+		   div_u64(lock_avg_ns, 1000),
+		   div_u64(lock_updates_ns_max, 1000));
+	seq_printf(seq,
+		   "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%llu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n",
+		   atomic64_read(&snap_stats->snap_fail_es_miss),
+		   atomic64_read(&snap_stats->snap_fail_es_delayed),
+		   atomic64_read(&snap_stats->snap_fail_es_other),
+		   atomic64_read(&snap_stats->snap_fail_inodes_cap),
+		   atomic64_read(&snap_stats->snap_fail_ranges_cap),
+		   atomic64_read(&snap_stats->snap_fail_nomem),
+		   atomic64_read(&snap_stats->snap_fail_inode_loc),
+		   atomic64_read(&snap_stats->snap_fail_no_snap));
+
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3c869f0001c5..f1f8819a2a23 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4544,6 +4544,7 @@ static void ext4_fast_commit_init(struct super_block *sb)
 	sbi->s_fc_ineligible_tid = 0;
 	mutex_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats));
 	sbi->s_fc_replay_state.fc_regions = NULL;
 	sbi->s_fc_replay_state.fc_regions_size = 0;
 	sbi->s_fc_replay_state.fc_regions_used = 0;
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes in v8:
- Use trace_call__ext4_fc_lock_updates() at the guarded call site as
  suggested by Steven Rostedt, avoiding a second static_branch check.

Changes in v7:
- Address Sashiko review by reporting successfully snapshotted inode counts
  in ext4_fc_lock_updates when snapshotting stops early.

Changes in v6:
- Drop explicit ext4_fc_snap_err assignments and rely on enum
  auto-increment.
- Treat locked_ns as trace-only in this patch and calculate it only when
  ext4_fc_lock_updates is enabled, as suggested by Steven Rostedt.

 fs/ext4/ext4.h              | 15 ++++++++
 fs/ext4/fast_commit.c       | 74 +++++++++++++++++++++++++++++--------
 include/trace/events/ext4.h | 61 ++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 927173bc8381..dd09d00a73af 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1027,6 +1027,21 @@ enum {
 
 struct ext4_fc_inode_snap;
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE = 0,
+	EXT4_FC_SNAP_ERR_ES_MISS,
+	EXT4_FC_SNAP_ERR_ES_DELAYED,
+	EXT4_FC_SNAP_ERR_ES_OTHER,
+	EXT4_FC_SNAP_ERR_INODES_CAP,
+	EXT4_FC_SNAP_ERR_RANGES_CAP,
+	EXT4_FC_SNAP_ERR_NOMEM,
+	EXT4_FC_SNAP_ERR_INODE_LOC,
+};
+
 /*
  * fourth extended file system inode data in memory
  */
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 9e73c83b0e25..dc08f8ff43d9 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -194,6 +194,12 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -968,11 +974,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -997,11 +1004,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		ext4_lblk_t len;
 		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > remaining)
@@ -1011,12 +1023,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1041,6 +1058,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1060,7 +1078,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1072,8 +1090,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1082,6 +1102,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1092,7 +1113,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1193,7 +1214,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1211,6 +1235,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1234,6 +1260,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1258,16 +1286,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = idx;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1276,10 +1308,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1324,13 +1361,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1344,8 +1381,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	if (trace_ext4_fc_lock_updates_enabled()) {
+		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
+	}
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1550,7 +1594,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f493642cf121..7028a28316fa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
+#undef EM
+#undef EMe
+#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+
+#define TRACE_SNAP_ERR						\
+	EM(NONE)						\
+	EM(ES_MISS)						\
+	EM(ES_DELAYED)						\
+	EM(ES_OTHER)						\
+	EM(INODES_CAP)						\
+	EM(RANGES_CAP)						\
+	EM(NOMEM)						\
+	EMe(INODE_LOC)
+
+TRACE_SNAP_ERR
+
+#undef EM
+#undef EMe
+
 #define show_fc_reason(reason)						\
 	__print_symbolic(reason,					\
 		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
@@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
+#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
+
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __print_symbolic(__entry->snap_err,
+						 TRACE_SNAP_ERR))
+);
+
+#undef EM
+#undef EMe
+#undef TRACE_SNAP_ERR
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v7:
- Address Sashiko review by guarding snapshot range arithmetic near
  EXT_MAX_BLOCKS to avoid cur_lblk / remaining-range wraparound in the
  snapshot walk.

 fs/ext4/fast_commit.c | 253 ++++++++++++++++++++++++++++++------------
 1 file changed, 179 insertions(+), 74 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 8a6981e50ffe..9e73c83b0e25 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -184,6 +184,15 @@
 
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
+static struct kmem_cache *ext4_fc_range_cachep;
+
+/*
+ * Avoid spending unbounded time/memory snapshotting highly fragmented files
+ * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to
+ * full commit.
+ */
+#define EXT4_FC_SNAPSHOT_MAX_INODES	1024
+#define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
@@ -939,7 +948,7 @@ static void ext4_fc_free_ranges(struct list_head *head)
 
 	list_for_each_entry_safe(range, range_n, head, list) {
 		list_del(&range->list);
-		kfree(range);
+		kmem_cache_free(ext4_fc_range_cachep, range);
 	}
 }
 
@@ -957,16 +966,19 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 }
 
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
-				       struct list_head *ranges)
+				       struct list_head *ranges,
+				       unsigned int nr_ranges_total,
+				       unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
-	struct ext4_map_blocks map;
-	int ret;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
 		spin_unlock(&ei->i_fc_lock);
+		if (nr_rangesp)
+			*nr_rangesp = 0;
 		return 0;
 	}
 	start_lblk = ei->i_fc_lblk_start;
@@ -980,61 +992,82 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		   (unsigned long long)inode->i_ino);
 
 	while (cur_lblk <= end_lblk) {
+		struct extent_status es;
 		struct ext4_fc_range *range;
+		ext4_lblk_t len;
+		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
-		map.m_lblk = cur_lblk;
-		map.m_len = end_lblk - cur_lblk + 1;
-		ret = ext4_map_blocks(NULL, inode, &map,
-				      EXT4_GET_BLOCKS_IO_SUBMIT |
-				      EXT4_EX_NOCACHE);
-		if (ret < 0)
-			return -ECANCELED;
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+			return -EAGAIN;
+
+		if (ext4_es_is_delayed(&es))
+			return -EAGAIN;
 
-		if (map.m_len == 0) {
+		len = es.es_len - (cur_lblk - es.es_lblk);
+		if (len > remaining)
+			len = remaining;
+		if (len == 0) {
 			cur_lblk++;
 			continue;
 		}
 
-		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+			return -E2BIG;
+
+		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range)
 			return -ENOMEM;
+		nr_ranges++;
 
-		range->lblk = map.m_lblk;
-		range->len = map.m_len;
+		range->lblk = cur_lblk;
+		range->len = len;
 		range->pblk = 0;
 		range->unwritten = false;
 
-		if (ret == 0) {
+		if (ext4_es_is_hole(&es)) {
 			range->tag = EXT4_FC_TAG_DEL_RANGE;
-		} else {
-			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
-				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
-
-			/* Limit the number of blocks in one extent */
-			map.m_len = min(max, map.m_len);
+		} else if (ext4_es_is_written(&es) ||
+			   ext4_es_is_unwritten(&es)) {
+			unsigned int max;
 
 			range->tag = EXT4_FC_TAG_ADD_RANGE;
-			range->len = map.m_len;
-			range->pblk = map.m_pblk;
-			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
+			range->pblk = ext4_es_pblock(&es) +
+				      (cur_lblk - es.es_lblk);
+			range->unwritten = ext4_es_is_unwritten(&es);
+
+			max = range->unwritten ? EXT_UNWRITTEN_MAX_LEN :
+						 EXT_INIT_MAX_LEN;
+			if (range->len > max)
+				range->len = max;
+		} else {
+			kmem_cache_free(ext4_fc_range_cachep, range);
+			return -EAGAIN;
 		}
 
 		INIT_LIST_HEAD(&range->list);
 		list_add_tail(&range->list, ranges);
 
-		cur_lblk += map.m_len;
+		if ((u64)range->len > (u64)end_lblk - cur_lblk)
+			break;
+
+		cur_lblk += range->len;
 	}
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-static int ext4_fc_snapshot_inode(struct inode *inode)
+static int ext4_fc_snapshot_inode(struct inode *inode,
+				  unsigned int nr_ranges_total,
+				  unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
 	LIST_HEAD(ranges);
+	unsigned int nr_ranges = 0;
 	int ret;
 	int alloc_ctx;
 
@@ -1058,7 +1091,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
 	brelse(iloc.bh);
 
-	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
+					  &nr_ranges);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1071,10 +1105,11 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
 {
@@ -1153,49 +1188,32 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
-static int ext4_fc_snapshot_inodes(journal_t *journal)
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp);
+
+static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
+				   unsigned int inodes_size)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *iter;
 	struct ext4_fc_dentry_update *fc_dentry;
-	struct inode **inodes;
-	unsigned int nr_inodes = 0;
 	unsigned int i = 0;
+	unsigned int idx;
+	unsigned int nr_ranges = 0;
 	int ret = 0;
 	int alloc_ctx;
 
-	alloc_ctx = ext4_fc_lock(sb);
-	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
-		nr_inodes++;
-
-	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
-		struct ext4_inode_info *ei;
-
-		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
-			continue;
-		if (list_empty(&fc_dentry->fcd_dilist))
-			continue;
-
-		/* See the comment in ext4_fc_commit_dentry_updates(). */
-		ei = list_first_entry(&fc_dentry->fcd_dilist,
-				      struct ext4_inode_info, i_fc_dilist);
-		if (!list_empty(&ei->i_fc_list))
-			continue;
-
-		nr_inodes++;
-	}
-	ext4_fc_unlock(sb, alloc_ctx);
-
-	if (!nr_inodes)
+	if (!inodes_size)
 		return 0;
 
-	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
-	if (!inodes)
-		return -ENOMEM;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		inodes[i++] = &iter->vfs_inode;
 	}
 
@@ -1215,6 +1233,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		/*
 		 * Create-only inodes may only be referenced via fcd_dilist and
 		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
@@ -1226,15 +1248,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 		inodes[i++] = inode;
 	}
+unlock:
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+	if (ret)
+		return ret;
+
+	for (idx = 0; idx < i; idx++) {
+		unsigned int inode_ranges = 0;
+
+		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
+					     &inode_ranges);
 		if (ret)
 			break;
+		nr_ranges += inode_ranges;
 	}
 
-	kvfree(inodes);
 	return ret;
 }
 
@@ -1245,6 +1274,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
+	struct inode **inodes;
+	unsigned int inodes_size;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
@@ -1294,6 +1325,10 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		return ret;
 
 
+	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
+	if (ret)
+		return ret;
+
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
 	/*
@@ -1309,8 +1344,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 	jbd2_journal_unlock_updates(journal);
+	kvfree(inodes);
 	if (ret)
 		return ret;
 
@@ -1366,6 +1402,64 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	return ret;
 }
 
+static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	unsigned int nr_inodes = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	return nr_inodes;
+}
+
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp)
+{
+	unsigned int nr_inodes = ext4_fc_count_snapshot_inodes(sb);
+	struct inode **inodes;
+
+	*inodesp = NULL;
+	*nr_inodesp = 0;
+
+	if (!nr_inodes)
+		return 0;
+
+	if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES)
+		return -E2BIG;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	*inodesp = inodes;
+	*nr_inodesp = nr_inodes;
+	return 0;
+}
+
 static void ext4_fc_update_stats(struct super_block *sb, int status,
 				 u64 commit_time, int nblks, tid_t commit_tid)
 {
@@ -1458,7 +1552,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
-		status = EXT4_FC_STATUS_FAILED;
+		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
+			status = EXT4_FC_STATUS_INELIGIBLE;
+		else
+			status = EXT4_FC_STATUS_FAILED;
 		goto fallback;
 	}
 	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
@@ -1539,26 +1636,27 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
 		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
-					     struct ext4_fc_dentry_update,
-					     fcd_list);
+						 struct ext4_fc_dentry_update,
+						 fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
 		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
-		    !list_empty(&fc_dentry->fcd_dilist)) {
+			!list_empty(&fc_dentry->fcd_dilist)) {
 			/* See the comment in ext4_fc_commit_dentry_updates(). */
 			ei = list_first_entry(&fc_dentry->fcd_dilist,
-					      struct ext4_inode_info,
-					      i_fc_dilist);
+						  struct ext4_inode_info,
+						  i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
 			spin_lock(&ei->i_fc_lock);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_REQUEUE);
+						   EXT4_STATE_FC_REQUEUE);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_COMMITTING);
+						   EXT4_STATE_FC_COMMITTING);
 			spin_unlock(&ei->i_fc_lock);
 			/*
 			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
-			 * visible before we send the wakeup. Pairs with implicit
-			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 * visible before we send the wakeup. Pairs with
+			 * implicit barrier in prepare_to_wait() in
+			 * ext4_fc_del().
 			 */
 			smp_mb();
 			ext4_fc_wake_inode_state(&ei->vfs_inode,
@@ -2538,13 +2636,20 @@ int __init ext4_fc_init_dentry_cache(void)
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
 					   SLAB_RECLAIM_ACCOUNT);
 
-	if (ext4_fc_dentry_cachep == NULL)
+	if (!ext4_fc_dentry_cachep)
 		return -ENOMEM;
 
+	ext4_fc_range_cachep = KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT);
+	if (!ext4_fc_range_cachep) {
+		kmem_cache_destroy(ext4_fc_dentry_cachep);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void ext4_fc_destroy_dentry_cache(void)
 {
+	kmem_cache_destroy(ext4_fc_range_cachep);
 	kmem_cache_destroy(ext4_fc_dentry_cachep);
 }
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v8:
- Factor out small ext4_fc_wait_inode_state()/ext4_fc_wake_inode_state()
  helpers so the repeated FC state wait/wake mapping is kept in one place.
- Re-check EXT4_STATE_FC_COMMITTING after waking from
  EXT4_STATE_FC_FLUSHING_DATA in ext4_fc_del(), so list deletion only
  happens after both predicates pass under the same s_fc_lock critical
  section.

 fs/ext4/fast_commit.c | 124 +++++++++++++++++++++++++-----------------
 1 file changed, 75 insertions(+), 49 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 673668860e2d..8a6981e50ffe 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -235,6 +235,37 @@ static bool ext4_fc_eligible(struct super_block *sb)
 		!(ext4_test_mount_flag(sb, EXT4_MF_FC_INELIGIBLE));
 }
 
+/*
+ * Wait for an inode fast-commit state bit to clear while dropping the
+ * fast-commit lock around schedule().
+ */
+static void ext4_fc_wait_inode_state(struct inode *inode, int bit,
+				     int *alloc_ctx)
+{
+	wait_queue_head_t *wq;
+	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
+	int wait_bit = ext4_inode_state_wait_bit(bit);
+
+	while (ext4_test_inode_state(inode, bit)) {
+		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
+
+		wq = bit_waitqueue(wait_word, wait_bit);
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, bit)) {
+			ext4_fc_unlock(inode->i_sb, *alloc_ctx);
+			schedule();
+			*alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+}
+
+static inline void ext4_fc_wake_inode_state(struct inode *inode, int bit)
+{
+	wake_up_bit(ext4_inode_state_wait_word(inode),
+		    ext4_inode_state_wait_bit(bit));
+}
+
 /*
  * Remove inode from fast commit list. If the inode is being committed
  * we wait until inode commit is done.
@@ -243,12 +274,6 @@ void ext4_fc_del(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_dentry_update *fc_dentry;
-	wait_queue_head_t *wq;
-	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int committing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
-	int flushing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 	int alloc_ctx;
 
 	if (ext4_fc_disabled(inode->i_sb))
@@ -263,32 +288,19 @@ void ext4_fc_del(struct inode *inode)
 
 	/*
 	 * Wait for ongoing fast commit to finish. We cannot remove the inode
-	 * from fast commit lists while it is being committed.
+	 * from fast commit lists while it is being committed. If we wake from
+	 * FC_FLUSHING_DATA, re-check FC_COMMITTING before deleting because the
+	 * commit thread sets FC_COMMITTING only after clearing FLUSHING_DATA.
 	 */
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		DEFINE_WAIT_BIT(wait, wait_word, committing_wait_bit);
+	for (;;) {
+		ext4_fc_wait_inode_state(inode, EXT4_STATE_FC_COMMITTING,
+					 &alloc_ctx);
 
-		wq = bit_waitqueue(wait_word, committing_wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-			ext4_fc_unlock(inode->i_sb, alloc_ctx);
-			schedule();
-			alloc_ctx = ext4_fc_lock(inode->i_sb);
-		}
-		finish_wait(wq, &wait.wq_entry);
-	}
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-		DEFINE_WAIT_BIT(wait, wait_word, flushing_wait_bit);
+		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA))
+			break;
 
-		wq = bit_waitqueue(wait_word, flushing_wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-			ext4_fc_unlock(inode->i_sb, alloc_ctx);
-			schedule();
-			alloc_ctx = ext4_fc_lock(inode->i_sb);
-		}
-		finish_wait(wq, &wait.wq_entry);
+		ext4_fc_wait_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA,
+					 &alloc_ctx);
 	}
 
 	ext4_fc_free_inode_snap(inode);
@@ -1184,13 +1196,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		inodes[i] = igrab(&iter->vfs_inode);
-		if (inodes[i])
-			i++;
+		inodes[i++] = &iter->vfs_inode;
 	}
 
 	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
 		struct ext4_inode_info *ei;
+		struct inode *inode;
 
 		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
 			continue;
@@ -1200,12 +1211,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		/* See the comment in ext4_fc_commit_dentry_updates(). */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				      struct ext4_inode_info, i_fc_dilist);
+		inode = &ei->vfs_inode;
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
-		inodes[i] = igrab(&ei->vfs_inode);
-		if (inodes[i])
-			i++;
+		/*
+		 * Create-only inodes may only be referenced via fcd_dilist and
+		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
+		 * we are snapshotting, but inode eviction calls ext4_fc_del(),
+		 * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING
+		 * so the inode stays pinned and the snapshot stays valid until
+		 * ext4_fc_cleanup().
+		 */
+		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		inodes[i++] = inode;
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
@@ -1215,10 +1234,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 			break;
 	}
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		if (inodes[nr_inodes])
-			iput(inodes[nr_inodes]);
-	}
 	kvfree(inodes);
 	return ret;
 }
@@ -1234,8 +1249,6 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
-	int flushing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1261,8 +1274,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		ext4_clear_inode_state(&iter->vfs_inode,
 				       EXT4_STATE_FC_FLUSHING_DATA);
-		wake_up_bit(ext4_inode_state_wait_word(&iter->vfs_inode),
-			    flushing_wait_bit);
+		ext4_fc_wake_inode_state(&iter->vfs_inode,
+					 EXT4_STATE_FC_FLUSHING_DATA);
 	}
 
 	/*
@@ -1285,8 +1298,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	jbd2_journal_lock_updates(journal);
 	/*
 	 * The journal is now locked. No more handles can start and all the
-	 * previous handles are now drained. We now mark the inodes on the
-	 * commit queue as being committed.
+	 * previous handles are now drained. Snapshotting happens in this
+	 * window so log writing can consume only stable snapshots without
+	 * doing logical-to-physical mapping.
 	 */
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
@@ -1482,8 +1496,6 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 	struct ext4_inode_info *ei;
 	struct ext4_fc_dentry_update *fc_dentry;
 	int alloc_ctx;
-	int committing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
 
 	if (full && sbi->s_fc_bh)
 		sbi->s_fc_bh = NULL;
@@ -1521,8 +1533,8 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
-		wake_up_bit(ext4_inode_state_wait_word(&ei->vfs_inode),
-			    committing_wait_bit);
+		ext4_fc_wake_inode_state(&ei->vfs_inode,
+					 EXT4_STATE_FC_COMMITTING);
 	}
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
@@ -1537,6 +1549,20 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					      struct ext4_inode_info,
 					      i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
+			spin_lock(&ei->i_fc_lock);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_REQUEUE);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_COMMITTING);
+			spin_unlock(&ei->i_fc_lock);
+			/*
+			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+			 * visible before we send the wakeup. Pairs with implicit
+			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 */
+			smp_mb();
+			ext4_fc_wake_inode_state(&ei->vfs_inode,
+						 EXT4_STATE_FC_COMMITTING);
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in RFC v8:
- Base the series on "ext4: fix fast commit wait/wake bit mapping on
  64-bit", so the FC_COMMITTING/FC_FLUSHING_DATA wait and wake paths use
  the shared helper mapping.

 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 106 ++++++++++++++++++++----------------------
 2 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 115a3c94db16..927173bc8381 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1991,6 +1991,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 0c49144e8ca2..673668860e2d 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -62,9 +62,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -218,6 +217,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -245,7 +245,10 @@ void ext4_fc_del(struct inode *inode)
 	struct ext4_fc_dentry_update *fc_dentry;
 	wait_queue_head_t *wq;
 	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int wait_bit = ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
+	int committing_wait_bit =
+		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
+	int flushing_wait_bit =
+		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 	int alloc_ctx;
 
 	if (ext4_fc_disabled(inode->i_sb))
@@ -259,26 +262,26 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		DEFINE_WAIT_BIT(wait, wait_word, committing_wait_bit);
+
+		wq = bit_waitqueue(wait_word, committing_wait_bit);
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
+		DEFINE_WAIT_BIT(wait, wait_word, flushing_wait_bit);
 
-		wq = bit_waitqueue(wait_word, wait_bit);
+		wq = bit_waitqueue(wait_word, flushing_wait_bit);
 		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 		if (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 			ext4_fc_unlock(inode->i_sb, alloc_ctx);
@@ -287,19 +290,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +377,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -541,10 +549,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
-	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int wait_bit = ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -560,21 +564,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
-
-		wq = bit_waitqueue(wait_word, wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1499,32 +1493,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 		wake_up_bit(ext4_inode_state_wait_word(&ei->vfs_inode),
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
  without changing the subclassing logic.

 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e337a37bb6fb..115a3c94db16 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1015,12 +1015,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3c869f0001c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1431,6 +1431,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5910,6 +5913,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 1/7] ext4: fast commit: snapshot inode state before writing log
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v7:
- Drop the stale i_fc_wait initialization after rebasing onto the new
  linux-next base.

Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Fix the inode debug print format after rebasing.

 fs/ext4/ext4.h        |  22 ++-
 fs/ext4/fast_commit.c | 331 +++++++++++++++++++++++++++++++++++-------
 fs/ext4/inode.c       |  51 +++++++
 3 files changed, 352 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6569d1d575a0..e337a37bb6fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1023,6 +1023,7 @@ enum {
 	I_DATA_SEM_EA
 };
 
+struct ext4_fc_inode_snap;
 
 /*
  * fourth extended file system inode data in memory
@@ -1079,6 +1080,22 @@ struct ext4_inode_info {
 	/* End of lblk range that needs to be committed in this fast commit */
 	ext4_lblk_t i_fc_lblk_len;
 
+	/*
+	 * Commit-time fast commit snapshots.
+	 *
+	 * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast
+	 * commit log writing path reads the snapshot under sbi->s_fc_lock while
+	 * serializing fast commit TLVs.
+	 *
+	 * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the
+	 * corresponding cleanup / eviction paths.
+	 *
+	 * i_fc_snap points to per-inode snapshot data for fast commit:
+	 * - a raw inode snapshot for EXT4_FC_TAG_INODE
+	 * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE
+	 */
+	struct ext4_fc_inode_snap *i_fc_snap;
+
 	spinlock_t i_raw_lock;	/* protects updates to the raw inode */
 
 	/*
@@ -3100,8 +3117,9 @@ extern int  ext4_file_getattr(struct mnt_idmap *, const struct path *,
 			      struct kstat *, u32, unsigned int);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
-extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
-extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 1775bce9649a..0c49144e8ca2 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -56,21 +56,23 @@
  *     deleted while it is being flushed.
  * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA"
  *     state.
- * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
- *     all the exsiting handles finish and no new handles can start.
- * [4] Mark all the fast commit eligible inodes as undergoing fast commit
- *     by setting "EXT4_STATE_FC_COMMITTING" state.
- * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensures
+ *     that all the existing handles finish and no new handles can start.
+ * [4] Mark all the fast commit eligible inodes as undergoing fast commit by
+ *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
+ *     needed for log writing.
+ * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
  *     starting of new handles. If new handles try to start an update on
  *     any of the inodes that are being committed, ext4_fc_track_inode()
  *     will block until those inodes have finished the fast commit.
  * [6] Commit all the directory entry updates in the fast commit space.
- * [7] Commit all the changed inodes in the fast commit space and clear
- *     "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
  *     section for more details).
+ * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in
+ *     ext4_fc_cleanup().
  *
- * All the inode updates must be enclosed within jbd2_jounrnal_start()
+ * All the inode updates must be enclosed within jbd2_journal_start()
  * and jbd2_journal_stop() similar to JBD2 journaling.
  *
  * Fast Commit Ineligibility
@@ -200,6 +202,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	unlock_buffer(bh);
 }
 
+static void ext4_fc_free_inode_snap(struct inode *inode);
+
 static inline void ext4_fc_reset_inode(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -216,6 +220,7 @@ void ext4_fc_init_inode(struct inode *inode)
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
+	ei->i_fc_snap = NULL;
 }
 
 static bool ext4_fc_disabled(struct super_block *sb)
@@ -248,6 +253,7 @@ void ext4_fc_del(struct inode *inode)
 
 	alloc_ctx = ext4_fc_lock(inode->i_sb);
 	if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
+		ext4_fc_free_inode_snap(inode);
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
@@ -281,6 +287,7 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
@@ -817,6 +824,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 	return true;
 }
 
+struct ext4_fc_range {
+	struct list_head list;
+	u16 tag;
+	ext4_lblk_t lblk;
+	ext4_lblk_t len;
+	ext4_fsblk_t pblk;
+	bool unwritten;
+};
+
+struct ext4_fc_inode_snap {
+	struct list_head data_list;
+	unsigned int inode_len;
+	u8 inode_buf[];
+};
+
 /*
  * Writes inode in the fast commit space under TLV with tag @tag.
  * Returns 0 on success, error on failure.
@@ -824,21 +846,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
-	int ret;
-	struct ext4_iloc iloc;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
+	u8 *src;
+	int inode_len;
+	int ret;
 
-	ret = ext4_get_inode_loc(inode, &iloc);
-	if (ret)
-		return ret;
+	if (!snap)
+		return -ECANCELED;
 
-	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
-		inode_len = EXT4_INODE_SIZE(inode->i_sb);
-	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
-		inode_len += ei->i_extra_isize;
+	src = snap->inode_buf;
+	inode_len = snap->inode_len;
+	if (!src || inode_len == 0)
+		return -ECANCELED;
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -854,10 +876,9 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	dst += EXT4_FC_TAG_BASE_LEN;
 	memcpy(dst, &fc_inode, sizeof(fc_inode));
 	dst += sizeof(fc_inode);
-	memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	memcpy(dst, src, inode_len);
 	ret = 0;
 err:
-	brelse(iloc.bh);
 	return ret;
 }
 
@@ -867,12 +888,74 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
  */
 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
-	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_map_blocks map;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
+	struct ext4_fc_range *range;
+
+	if (!snap)
+		return -ECANCELED;
+
+	list_for_each_entry(range, &snap->data_list, list) {
+		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(range->lblk);
+			lrange.fc_len = cpu_to_le32(range->len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					     sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+			continue;
+		}
+
+		fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+		ex = (struct ext4_extent *)&fc_ext.fc_ex;
+		ex->ee_block = cpu_to_le32(range->lblk);
+		ex->ee_len = cpu_to_le16(range->len);
+		ext4_ext_store_pblock(ex, range->pblk);
+		if (range->unwritten)
+			ext4_ext_mark_unwritten(ex);
+		else
+			ext4_ext_mark_initialized(ex);
+
+		if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+				     sizeof(fc_ext), (u8 *)&fc_ext, crc))
+			return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static void ext4_fc_free_ranges(struct list_head *head)
+{
+	struct ext4_fc_range *range, *range_n;
+
+	list_for_each_entry_safe(range, range_n, head, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+static void ext4_fc_free_inode_snap(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+
+	if (!snap)
+		return;
+
+	ext4_fc_free_ranges(&snap->data_list);
+	kfree(snap);
+	ei->i_fc_snap = NULL;
+}
+
+static int ext4_fc_snapshot_inode_data(struct inode *inode,
+				       struct list_head *ranges)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	struct ext4_map_blocks map;
 	int ret;
 
 	spin_lock(&ei->i_fc_lock);
@@ -880,18 +963,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 		spin_unlock(&ei->i_fc_lock);
 		return 0;
 	}
-	old_blk_size = ei->i_fc_lblk_start;
-	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	start_lblk = ei->i_fc_lblk_start;
+	end_lblk = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
 	ei->i_fc_lblk_len = 0;
 	spin_unlock(&ei->i_fc_lock);
 
-	cur_lblk_off = old_blk_size;
-	ext4_debug("will try writing %d to %d for inode %llu\n",
-		   cur_lblk_off, new_blk_size, inode->i_ino);
+	cur_lblk = start_lblk;
+	ext4_debug("snapshot data ranges %u-%u for inode %llu\n",
+		   start_lblk, end_lblk,
+		   (unsigned long long)inode->i_ino);
+
+	while (cur_lblk <= end_lblk) {
+		struct ext4_fc_range *range;
 
-	while (cur_lblk_off <= new_blk_size) {
-		map.m_lblk = cur_lblk_off;
-		map.m_len = new_blk_size - cur_lblk_off + 1;
+		map.m_lblk = cur_lblk;
+		map.m_len = end_lblk - cur_lblk + 1;
 		ret = ext4_map_blocks(NULL, inode, &map,
 				      EXT4_GET_BLOCKS_IO_SUBMIT |
 				      EXT4_EX_NOCACHE);
@@ -899,17 +985,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			return -ECANCELED;
 
 		if (map.m_len == 0) {
-			cur_lblk_off++;
+			cur_lblk++;
 			continue;
 		}
 
+		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (!range)
+			return -ENOMEM;
+
+		range->lblk = map.m_lblk;
+		range->len = map.m_len;
+		range->pblk = 0;
+		range->unwritten = false;
+
 		if (ret == 0) {
-			lrange.fc_ino = cpu_to_le32(inode->i_ino);
-			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
-			lrange.fc_len = cpu_to_le32(map.m_len);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
-					    sizeof(lrange), (u8 *)&lrange, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_DEL_RANGE;
 		} else {
 			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
 				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
@@ -917,26 +1007,67 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			/* Limit the number of blocks in one extent */
 			map.m_len = min(max, map.m_len);
 
-			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
-			ex = (struct ext4_extent *)&fc_ext.fc_ex;
-			ex->ee_block = cpu_to_le32(map.m_lblk);
-			ex->ee_len = cpu_to_le16(map.m_len);
-			ext4_ext_store_pblock(ex, map.m_pblk);
-			if (map.m_flags & EXT4_MAP_UNWRITTEN)
-				ext4_ext_mark_unwritten(ex);
-			else
-				ext4_ext_mark_initialized(ex);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
-					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_ADD_RANGE;
+			range->len = map.m_len;
+			range->pblk = map.m_pblk;
+			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
 		}
 
-		cur_lblk_off += map.m_len;
+		INIT_LIST_HEAD(&range->list);
+		list_add_tail(&range->list, ranges);
+
+		cur_lblk += map.m_len;
 	}
 
 	return 0;
 }
 
+static int ext4_fc_snapshot_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap;
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	struct ext4_iloc iloc;
+	LIST_HEAD(ranges);
+	int ret;
+	int alloc_ctx;
+
+	ret = ext4_get_inode_loc_noio(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
+		inode_len = EXT4_INODE_SIZE(inode->i_sb);
+	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
+	if (!snap) {
+		brelse(iloc.bh);
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&snap->data_list);
+	snap->inode_len = inode_len;
+
+	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	brelse(iloc.bh);
+
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	if (ret) {
+		kfree(snap);
+		ext4_fc_free_ranges(&ranges);
+		return ret;
+	}
+
+	alloc_ctx = ext4_fc_lock(inode->i_sb);
+	ext4_fc_free_inode_snap(inode);
+	ei->i_fc_snap = snap;
+	list_splice_tail_init(&ranges, &snap->data_list);
+	ext4_fc_unlock(inode->i_sb, alloc_ctx);
+
+	return 0;
+}
+
 
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
@@ -987,6 +1118,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 		 */
 		if (list_empty(&fc_dentry->fcd_dilist))
 			continue;
+		/*
+		 * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created
+		 * inode's i_fc_dilist list (kept singular), so we can recover the
+		 * inode through it.
+		 */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				struct ext4_inode_info, i_fc_dilist);
 		inode = &ei->vfs_inode;
@@ -1011,6 +1147,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
+static int ext4_fc_snapshot_inodes(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode **inodes;
+	unsigned int nr_inodes = 0;
+	unsigned int i = 0;
+	int ret = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	if (!nr_inodes)
+		return 0;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		inodes[i] = igrab(&iter->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		inodes[i] = igrab(&ei->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+		if (ret)
+			break;
+	}
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		if (inodes[nr_inodes])
+			iput(inodes[nr_inodes]);
+	}
+	kvfree(inodes);
+	return ret;
+}
+
 static int ext4_fc_perform_commit(journal_t *journal)
 {
 	struct super_block *sb = journal->j_private;
@@ -1082,7 +1300,11 @@ static int ext4_fc_perform_commit(journal_t *journal)
 				     EXT4_STATE_FC_COMMITTING);
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
+
+	ret = ext4_fc_snapshot_inodes(journal);
 	jbd2_journal_unlock_updates(journal);
+	if (ret)
+		return ret;
 
 	/*
 	 * Step 5: If file system device is different from journal device,
@@ -1281,6 +1503,7 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
+		ext4_fc_free_inode_snap(&ei->vfs_inode);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, ei->i_sync_tid)) {
@@ -1313,6 +1536,14 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					     struct ext4_fc_dentry_update,
 					     fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
+		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
+		    !list_empty(&fc_dentry->fcd_dilist)) {
+			/* See the comment in ext4_fc_commit_dentry_updates(). */
+			ei = list_first_entry(&fc_dentry->fcd_dilist,
+					      struct ext4_inode_info,
+					      i_fc_dilist);
+			ext4_fc_free_inode_snap(&ei->vfs_inode);
+		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
 		release_dentry_name_snapshot(&fc_dentry->fcd_name);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..4678612f82e8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5025,6 +5025,57 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 	return ret;
 }
 
+/*
+ * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_loc().
+ * It looks up the inode table block in the buffer cache and returns -EAGAIN if
+ * the block is not present or not uptodate, without starting any I/O.
+ */
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *bh;
+	ext4_fsblk_t block;
+	int inodes_per_block, inode_offset;
+	unsigned long ino = inode->i_ino;
+
+	iloc->bh = NULL;
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+		return -EFSCORRUPTED;
+
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/* Figure out the offset within the block group inode table. */
+	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+	inode_offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb));
+	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+	block = ext4_inode_table(sb, gdp);
+	if (block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) ||
+	    block >= ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+		ext4_error(sb,
+			   "Invalid inode table block %llu in block_group %u",
+			   block, iloc->block_group);
+		return -EFSCORRUPTED;
+	}
+	block += inode_offset / inodes_per_block;
+
+	bh = sb_find_get_block(sb, block);
+	if (!bh)
+		return -EAGAIN;
+	if (!ext4_buffer_uptodate(bh)) {
+		brelse(bh);
+		return -EAGAIN;
+	}
+
+	iloc->bh = bh;
+	return 0;
+}
+
 
 int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc)
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
	linux-trace-kernel, linux-kernel

Hi,

(This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
depends on patch "ext4: fix fast commit wait/wake bit mapping on
64-bit" [0]).

Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.

Zhang Yi suggested two possible directions to address the root cause:

1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory."

2. "Alternatively, recording the mapped range of tracking might also be
feasible."

This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.

Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().

I did not fully implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).

Lock inversion / deadlock model (before):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... hold i_data_sem (A)             mutex_lock(s_fc_lock) (B)
    ext4_fc_track_inode()             ext4_fc_write_inode_data()
      mutex_lock(s_fc_lock) (B)         ext4_map_blocks()
      wait FC_COMMITTING (sleep)          down_read(i_data_sem) (A)

This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.

New model (this series):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... maybe hold i_data_sem (A)        jbd2_journal_lock_updates()
    ext4_fc_track_*()                 snapshot inode + ranges (no map_blocks)
      mutex_lock(s_fc_lock) (B)       jbd2_journal_unlock_updates()
      if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
      no sleep                         write FC log from snapshots only
                                    cleanup: clear COMMITTING, requeue if set

The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaining FC_COMMITTING waiter is ext4_fc_del(). It checks
FC_COMMITTING and FC_FLUSHING_DATA while holding s_fc_lock, drops s_fc_lock
around the sleep, and rechecks FC_COMMITTING after a FC_FLUSHING_DATA wait
before deleting the inode from the fast commit lists. This keeps inode
lifetime/deletion synchronized with the commit thread's transition from
FLUSHING_DATA to COMMITTING.

This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.

Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.

ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.

The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.

To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.

Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.

Testing and measurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.

Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.

RFC v7 -> RFC v8:
- Base the series on "ext4: fix fast commit wait/wake bit
  mapping on 64-bit", which the fast commit wait/wake paths now depend on.
- Factor out small ext4_fc_wait_inode_state()/ext4_fc_wake_inode_state()
  helpers so the repeated FC state wait/wake mapping stays in one place.
- Use trace_call__ext4_fc_lock_updates() at the guarded tracepoint call site
  so the static branch is not checked twice.
- Address the Sashiko feedback around the commit-time snapshot lifecycle and
  snapshot stats accounting, including the FC_COMMITTING /
  FC_FLUSHING_DATA transition and stale snapshot sizing fallback. [3][4]

RFC v6 -> RFC v7:
- Rebase onto linux-next master as of 2026-05-09, commit e98d21c170b0
  ("Add linux-next specific files for 20260508").
- Address Sashiko review feedback for RFC v6. [2]
- Fix the reported snapshot range arithmetic issue near EXT_MAX_BLOCKS to
  avoid cur_lblk / range wraparound in the snapshot walk.
- Report successfully snapshotted inode counts in ext4_fc_lock_updates when
  snapshotting stops early, as reported by Sashiko.
- Use READ_ONCE() + div64_u64() for the fc_info lock_updates average, as
  reported by Sashiko.

RFC v5 -> RFC v6:
- Rebase onto linux-next master as of 2026-04-08.
- Address tracepoint review feedback by relying on enum auto-increment for
  snap_err values and by switching the guarded ext4_fc_lock_updates call site
  to trace_call__ext4_fc_lock_updates() to avoid the double static_branch. [1]
- Keep lock window accounting unconditional for fc_info while using the guarded
  direct tracepoint call.
- Fix the inode debug print format exposed by the rebase.

RFC v4 -> RFC v5:
- Patch 6: Make ext4_fc_lock_updates snap_err human readable via
  TRACE_DEFINE_ENUM() + __print_symbolic(), using a single TRACE_SNAP_ERR
  mapping while keeping the enum values stable for tooling.

RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
  ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
  during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
  inode image and mapped ranges (recording the mapped range of tracking as
  suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
  ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
  commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
  and snapshot fallback reasons.

RFC v2 -> RFC v3:
- rebase on top of
  https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u

RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix

RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/
RFC v4: https://lore.kernel.org/all/20260120112538.132774-1-me@linux.beauty/
RFC v5: https://lore.kernel.org/all/20260317084624.457185-1-me@linux.beauty/t/#u
RFC v6: https://lore.kernel.org/all/20260408112020.716706-1-me@linux.beauty/
RFC v7: https://lore.kernel.org/all/20260511084304.1559557-1-me@linux.beauty/

[0]: https://lore.kernel.org/all/20260513085818.552432-1-me@linux.beauty/
[1]: https://lore.kernel.org/all/acZJl8QUYEq8voqQ@BLRRASHENOY1.amd.com/T/#u
[2]: https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty
[3]: https://sashiko.dev/#/patchset/20260511084304.1559557-1-me%40linux.beauty?part=4
[4]: https://sashiko.dev/#/patchset/20260511084304.1559557-1-me%40linux.beauty?part=7

Thanks,

Li Chen (7):
  ext4: fast commit: snapshot inode state before writing log
  ext4: lockdep: handle i_data_sem subclassing for special inodes
  ext4: fast commit: avoid waiting for FC_COMMITTING
  ext4: fast commit: avoid self-deadlock in inode snapshotting
  ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
    snapshots
  ext4: fast commit: add lock_updates tracepoint
  ext4: fast commit: export snapshot stats in fc_info

 fs/ext4/ext4.h              |  73 ++++-
 fs/ext4/fast_commit.c       | 768 +++++++++++++++++++++++++++++++++++---------
 fs/ext4/inode.c             |  51 +++
 fs/ext4/super.c             |   9 +
 include/trace/events/ext4.h |  53 ++++
 5 files changed, 805 insertions(+), 149 deletions(-)

-- 
2.53.0

^ permalink raw reply

* [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-15  8:30 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-kernel, linux-trace-kernel, rostedt, Gabriele Monaco
In-Reply-To: <668f83581c58644a84cab5e6736864a439bb8e28.camel@redhat.com>

So this is what I meant. It's quick and dirty but seems to work as far
as I could test it.

I didn't change too much around to avoid confusing more, but it probably
needs a refactor for the functions positions and names. Some AI can do
that later after we agree on how it should look.

The main idea is (using current function names):

da_handle_start_[run_]_event() calls da_prepare_storage(), this
makes sure the storage is there and usable based on the strategy:
1. da_create_storage() plain allocation with kmalloc_nolock
2. da_create_or_get_pool() get a slot from the pool
3. da_fill_empty_storage() only set the target in a storage manually
   allocated before

The reason why you'd need 3. is that since da_handle_start_event() is
called from a tracepoint, you may in no way be able to allocate from
there, then you use manually somewhere else with
da_create_empty_storage() if you don't have the target and
da_create_or_get() if you do (this one is misleading, we should probably
simplify further).

The newly created 2. might be useful if you aren't on preempt-rt and
cannot sleep but also don't want a manual allocation (beware of lock
dependencies, it doesn't always work).

Now, I left your da_create_or_get_kmalloc() unwired because I don't
really see the use case (you use kmalloc_nolock because you cannot lock,
so if it fails you don't try a kmalloc). But if we really want to offer
a possibility to allocate with GFP_KERNEL, we can make 1. more
configurable.

Does this make sense to you?

Thanks,
Gabriele

---
 include/rv/da_monitor.h                  | 160 ++++++++++-------------
 kernel/trace/rv/monitors/nomiss/nomiss.c |   2 +-
 kernel/trace/rv/monitors/tlob/tlob.c     |  15 +--
 3 files changed, 74 insertions(+), 103 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 74aa95d9a284..3b4a36245531 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -21,6 +21,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/mempool.h>
 
 /*
  * Per-cpu variables require a unique name although static in some
@@ -67,6 +68,35 @@ static struct rv_monitor rv_this;
 #define da_id_type int
 #endif
 
+#define DA_ALLOC_AUTO 0
+#define DA_ALLOC_POOL 1
+#define DA_ALLOC_MANUAL 2
+
+/*
+ * Allow the per-object monitors to run allocation manually, necessary if the
+ * start condition is in a context problematic for allocation (e.g. scheduling).
+ * In such case, if the storage was pre-allocated without a target, set it now.
+ */
+#ifndef DA_MON_ALLOCATION_STRATEGY
+#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_AUTO
+#endif
+#ifndef DA_MON_POOL_SIZE
+#define DA_MON_POOL_SIZE 0
+#endif
+#if DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_MANUAL
+#define da_prepare_storage da_fill_empty_storage
+
+#elif DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL
+#define da_prepare_storage da_create_or_get_pool
+#if DA_MON_POOL_SIZE == 0
+#error "DA_ALLOC_POOL requires DA_MON_POOL_SIZE to be non-zero"
+#endif
+
+#else
+#define da_prepare_storage da_create_storage
+#endif /* DA_MON_ALLOCATION_STRATEGY */
+
+
 static void react(enum states curr_state, enum events event)
 {
 	rv_react(&rv_this,
@@ -448,62 +478,38 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
 }
 
 /*
- * Per-object pool state.
- *
- * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
- * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
- * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
- *
- * Because every field is wrapped in this struct and the struct itself is a
- * per-TU static, each monitor that includes this header gets a completely
- * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
- * (e.g. tlob) therefore coexist without any interference.
- *
- * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
- * required to prevent deadlock with task-context callers.  On PREEMPT_RT
- * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
- */
-struct da_per_obj_pool {
-	struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
-	struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
-	unsigned int                free_top;
-	spinlock_t                  lock;
-};
-
-static struct da_per_obj_pool da_pool = {
-	.lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
-};
+ * Per-object pool state using kmem_cache and mempool.
+ */
+static struct kmem_cache *da_mon_cache;
+static mempool_t *da_mon_pool;
 
 static void da_pool_return_cb(struct rcu_head *head)
 {
 	struct da_monitor_storage *ms =
 		container_of(head, struct da_monitor_storage, rcu);
-	unsigned long flags;
-
-	spin_lock_irqsave(&da_pool.lock, flags);
-	da_pool.free[da_pool.free_top++] = ms;
-	spin_unlock_irqrestore(&da_pool.lock, flags);
+	mempool_free(ms, da_mon_pool);
 }
 
-/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
-static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
+/* Pops a slot from the pre-allocated pool; returns NULL if exhausted. */
+static inline struct da_monitor *da_create_or_get_pool(da_id_type id,
+						       monitor_target target,
+						       struct da_monitor *da_mon)
 {
 	struct da_monitor_storage *mon_storage;
-	unsigned long flags;
 
-	spin_lock_irqsave(&da_pool.lock, flags);
-	if (!da_pool.free_top) {
-		spin_unlock_irqrestore(&da_pool.lock, flags);
-		return -ENOSPC;
-	}
-	mon_storage = da_pool.free[--da_pool.free_top];
-	spin_unlock_irqrestore(&da_pool.lock, flags);
+	if (da_mon)
+		return da_mon;
 
+	mon_storage = mempool_alloc_preallocated(da_mon_pool);
+	if (!mon_storage)
+		return NULL;
+
+	memset(mon_storage, 0, sizeof(*mon_storage));
 	mon_storage->id = id;
 	mon_storage->target = target;
 	guard(rcu)();
 	hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
-	return 0;
+	return &mon_storage->rv.da_mon;
 }
 
 /*
@@ -547,11 +553,12 @@ static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target target)
 }
 
 /* Create the per-object storage if not already there. */
-static inline int da_create_or_get(da_id_type id, monitor_target target)
+// NOTE: this is only needed for manual allocation!
+// we can refactor to have it only defined there, leaving it for now
+static inline void da_create_or_get(da_id_type id, monitor_target target)
 {
-	if (da_pool.storage)
-		return da_create_or_get_pool(id, target);
-	return da_create_or_get_kmalloc(id, target);
+	guard(rcu)();
+	da_create_storage(id, target, da_get_monitor(id, target));
 }
 
 /*
@@ -573,7 +580,7 @@ static inline void da_destroy_storage(da_id_type id)
 		return;
 	da_monitor_reset_hook(&mon_storage->rv.da_mon);
 	hash_del_rcu(&mon_storage->node);
-	if (da_pool.storage)
+	if (DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL)
 		call_rcu(&mon_storage->rcu, da_pool_return_cb);
 	else
 		kfree_rcu(mon_storage, rcu);
@@ -591,41 +598,26 @@ static __maybe_unused void da_monitor_reset_all(void)
 }
 
 /*
- * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
- *
- * Allocates @prealloc_count storage slots up-front so that da_create_or_get()
- * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
- * of da_monitor_init() for monitors that require pool mode.
+ * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
  */
-static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
+static inline int da_monitor_init(void)
 {
 	hash_init(da_monitor_ht);
+	if (DA_MON_ALLOCATION_STRATEGY != DA_ALLOC_POOL)
+		return 0;
 
-	da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
-				  GFP_KERNEL);
-	if (!da_pool.storage)
+	da_mon_cache = kmem_cache_create(__stringify(DA_MON_NAME) "_cache",
+					 sizeof(struct da_monitor_storage),
+					 0, 0, NULL);
+	if (!da_mon_cache)
 		return -ENOMEM;
 
-	da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
-				     GFP_KERNEL);
-	if (!da_pool.free) {
-		kfree(da_pool.storage);
-		da_pool.storage = NULL;
+	da_mon_pool = mempool_create_slab_pool(DA_MON_POOL_SIZE, da_mon_cache);
+	if (!da_mon_pool) {
+		kmem_cache_destroy(da_mon_cache);
+		da_mon_cache = NULL;
 		return -ENOMEM;
 	}
-
-	da_pool.free_top = 0;
-	for (unsigned int i = 0; i < prealloc_count; i++)
-		da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
-	return 0;
-}
-
-/*
- * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
- */
-static inline int da_monitor_init(void)
-{
-	hash_init(da_monitor_ht);
 	return 0;
 }
 
@@ -641,11 +633,10 @@ static inline void da_monitor_destroy_pool(void)
 	 * pending callback.
 	 */
 	rcu_barrier();
-	kfree(da_pool.storage);
-	da_pool.storage = NULL;
-	kfree(da_pool.free);
-	da_pool.free = NULL;
-	da_pool.free_top = 0;
+	mempool_destroy(da_mon_pool);
+	da_mon_pool = NULL;
+	kmem_cache_destroy(da_mon_cache);
+	da_mon_cache = NULL;
 }
 
 static inline void da_monitor_destroy_kmalloc(void)
@@ -676,23 +667,12 @@ static inline void da_monitor_destroy_kmalloc(void)
  */
 static inline void da_monitor_destroy(void)
 {
-	if (da_pool.storage)
+	if (DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL)
 		da_monitor_destroy_pool();
 	else
 		da_monitor_destroy_kmalloc();
 }
 
-/*
- * Allow the per-object monitors to run allocation manually, necessary if the
- * start condition is in a context problematic for allocation (e.g. scheduling).
- * In such case, if the storage was pre-allocated without a target, set it now.
- */
-#ifdef DA_SKIP_AUTO_ALLOC
-#define da_prepare_storage da_fill_empty_storage
-#else
-#define da_prepare_storage da_create_storage
-#endif /* DA_SKIP_AUTO_ALLOC */
-
 #endif /* RV_MON_TYPE */
 
 #if RV_MON_TYPE == RV_MON_GLOBAL || RV_MON_TYPE == RV_MON_PER_CPU
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
index 31f90f3638d8..f089cc0e2f10 100644
--- a/kernel/trace/rv/monitors/nomiss/nomiss.c
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -18,7 +18,7 @@
 #define RV_MON_TYPE RV_MON_PER_OBJ
 #define HA_TIMER_TYPE HA_TIMER_WHEEL
 /* The start condition is on sched_switch, it's dangerous to allocate there */
-#define DA_SKIP_AUTO_ALLOC
+#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
 typedef struct sched_dl_entity *monitor_target;
 #include "nomiss.h"
 #include <rv/ha_monitor.h>
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
index 90e7035a0b55..486a6bac5cf9 100644
--- a/kernel/trace/rv/monitors/tlob/tlob.c
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -88,8 +88,8 @@ struct tlob_task_state {
 
 #define RV_MON_TYPE RV_MON_PER_OBJ
 #define HA_TIMER_TYPE HA_TIMER_HRTIMER
-/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc. */
-#define DA_SKIP_AUTO_ALLOC
+#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
+#define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
 
 /* Type for da_monitor_storage.target; must be defined before the includes. */
 typedef struct tlob_task_state *monitor_target;
@@ -428,7 +428,6 @@ int tlob_start_task(struct task_struct *task, u64 threshold_us)
 	struct da_monitor *da_mon;
 	struct ha_monitor *ha_mon;
 	u64 now_ns;
-	int ret;
 
 	if (!da_monitor_enabled())
 		return -ENODEV;
@@ -457,14 +456,6 @@ int tlob_start_task(struct task_struct *task, u64 threshold_us)
 	ws->last_ts = ktime_get();
 	raw_spin_lock_init(&ws->entry_lock);
 
-	/* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
-	ret = da_create_or_get(task->pid, ws);
-	if (ret) {
-		put_task_struct(task);
-		kmem_cache_free(tlob_state_cache, ws);
-		return ret;
-	}
-
 	atomic_inc(&tlob_num_monitored);
 
 	/* Hold RCU across handle + timer setup to keep da_mon valid. */
@@ -955,7 +946,7 @@ static int __tlob_init_monitor(void)
 
 	atomic_set(&tlob_num_monitored, 0);
 
-	retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
+	retval = da_monitor_init();
 	if (retval) {
 		kmem_cache_destroy(tlob_state_cache);
 		tlob_state_cache = NULL;

-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-15  7:03 UTC (permalink / raw)
  To: leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <agXcPleVC9LGVCmj@gmail.com>


On Thu, May 14, 2026 at 07:37:14AM -0700, Breno Leitao wrote:
>On Thu, May 14, 2026 at 09:28:30PM +0800, Lance Yang wrote:
>> 
>> On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
>> >get_any_page() collapses three different failure modes into a single
>> >-EIO return:
>> >
>> >  * the put_page race in the !count_increased path;
>> >  * the HWPoisonHandlable() rejection that bounces out of
>> >    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
>> >  * the HWPoisonHandlable() rejection that goes through the
>> >    count_increased / put_page / shake_page retry loop.
>> >
>> >The first is transient (the page is racing with the allocator).  The
>> >second can be either transient (a userspace folio briefly off LRU
>> >during migration/compaction) or stable (slab/vmalloc/page-table/
>> >kernel-stack pages).  The third describes a stable kernel-owned page
>> >that the count_increased=true caller already held a reference on.
>> >
>> >Distinguish them on the return path: keep -EIO for both the put_page
>> >race and the -EBUSY-after-retries branch (shake_page() cannot drag a
>> >folio back from active migration, so we cannot prove the page is
>> >permanently kernel-owned from there), keep -EBUSY for the allocation
>> >race (unchanged), and return -ENOTRECOVERABLE only from the
>> >count_increased-true HWPoisonHandlable() rejection that exhausts its
>> >retries -- the caller's reference is structural evidence that the
>> >page is owned by the kernel.
>> >
>> >Extend the unhandlable-page pr_err() to fire for either errno and
>> >update the get_hwpoison_page() kerneldoc.
>> >
>> >memory_failure() still folds every negative return into
>> >MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>> >this patch is a no-op for users of memory_failure() and only changes
>> >the errno that soft_offline_page() can propagate to its callers.  A
>> >follow-up wires the new return code through memory_failure() and
>> >reports MF_MSG_KERNEL for the unrecoverable cases.
>> >
>> >Suggested-by: David Hildenbrand <david@kernel.org>
>> >Signed-off-by: Breno Leitao <leitao@debian.org>
>> >---
>> > mm/memory-failure.c | 18 +++++++++++++++---
>> > 1 file changed, 15 insertions(+), 3 deletions(-)
>> >
>> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> >index 49bcfbd04d213..bae883df3ccb2 100644
>> >--- a/mm/memory-failure.c
>> >+++ b/mm/memory-failure.c
>> >@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
>> > 				shake_page(p);
>> > 				goto try_again;
>> > 			}
>> >+			/*
>> >+			 * Return -EIO rather than -ENOTRECOVERABLE: this
>> >+			 * branch is also reached for pages that are merely
>> >+			 * off-LRU transiently (e.g. a folio in the middle
>> >+			 * of migration or compaction), which shake_page()
>> >+			 * cannot drag back.  The caller cannot prove the
>> >+			 * page is permanently kernel-owned from here, so
>> >+			 * keep it on the recoverable errno.
>> >+			 */
>> > 			ret = -EIO;
>> > 			goto out;
>> > 		}
>> >@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
>> > 			goto try_again;
>> > 		}
>> > 		put_page(p);
>> >-		ret = -EIO;
>> >+		ret = -ENOTRECOVERABLE;
>> > 	}
>> > out:
>> >-	if (ret == -EIO)
>> >+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
>> > 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>> > 
>> > 	return ret;
>> >@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
>> >  *         -EIO for pages on which we can not handle memory errors,
>> >  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
>> >  *         operations like allocation and free,
>> >- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
>> >+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
>> >+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
>> >+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
>> >+ *         kernel stacks, and similar non-LRU/non-buddy pages).
>> 
>> Did you test this patch series? I don't see how we ever get to
>> -ENOTRECOVERABLE there ...
>
>Yes, I did. I am using the following test case:

Okay.

>https://github.com/leitao/linux/commit/cfebe84ddeab5ac34ed456331db980d57e7025dc
>
>	# RUN_DESTRUCTIVE=1 tools/testing/selftests/mm/hwpoison-panic.sh
>	# enabling /proc/sys/vm/panic_on_unrecoverable_memory_failure
>	# injecting hwpoison at phys 0x2a00000 (Kernel rodata)
>	# expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'
>	[  501.113256] Memory failure: 0x2a00: recovery action for reserved kernel page: Ignored
>	[  501.113956] Kernel panic - not syncing: Memory failure: 0x2a00: unrecoverable page
>
>
>> Even with MF_COUNT_INCREASED, the first pass does:
>> 
>> 	if (flags & MF_COUNT_INCREASED)
>> 		count_increased = true;
>> 
>> 	[...]
>> 
>> 	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
>> 		ret = 1;
>> 	} else {
>> 		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
>> 			put_page(p);
>> 			shake_page(p);
>> 			count_increased = false;
>> 			goto try_again; <-
>> 		}
>> 		put_page(p);
>> 		ret = -ENOTRECOVERABLE;
>> 	}
>> 
>> Then we come back with count_increased=false:
>> 
>> try_again:
>> 	if (!count_increased) {
>> 		ret = __get_hwpoison_page(p, flags); <-
>> 		if (!ret) {
>> 		[...]
>> 		} else if (ret == -EBUSY) { <-
>> 		[...]
>> 			ret = -EIO;
>> 			goto out; <-
>> 		}
>> 	}
>> 
>> For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
>> 
>> 	if (!HWPoisonHandlable(&folio->page, flags))
>> 		return -EBUSY;
>> 
>> so they still seem to end up as -EIO ... Am I missing something?
>
>You are not, and thanks for catching this. I traced it again and the
>-ENOTRECOVERABLE branch is unreachable for slab/vmalloc/page-table pages
>exactly as you described. The __get_hwpoison_page() → -EBUSY → shake → retry
>loop catches them first and they exit as -EIO.

Wonder if it would be simpler to just do a positive check near the top
of get_any_page() instead. Something like:

static bool hwpoison_unrecoverable_kernel_page(struct page *page,
						unsigned long flags)
{
	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
		return false;

	return PageReserved(page) || PageSlab(page) || 
		PageTable(page) || PageLargeKmalloc(page);
}

static int get_any_page(struct page *p, unsigned long flags)
{
	int ret = 0, pass = 0;
	bool count_increased = false;

	if (flags & MF_COUNT_INCREASED)
		count_increased = true;

	if (hwpoison_unrecoverable_kernel_page(p, flags)) {
		if (count_increased)
			put_page(p);
		ret = -ENOTRECOVERABLE;
		goto out;
	}
[...]
}

Then get_any_page() could return -ENOTRECOVERABLE only for page types we
can positively identify as kernel-owned.

These types always fail HWPoisonHandlable(), so retrying does not really
buy us anything for them.

Won't cover everything (vmalloc, kernel stacks, etc. have no page_type
to key off), but that's fine - best effort, right?

Cheers, Lance

>
>The selftest I am using (link above) only validated the PageReserved
>short-circuit added in patch 3, which lives in memory_failure() and never
>reaches get_any_page().
>
>I even thought about this code path, and I was not convinced we should return
>-ENOTRECOVERABLE, thus I documented the following (as in this current patch)
>
>	@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
>			shake_page(p);
>			goto try_again;
>		}
>	+            /*
>	+             * Return -EIO rather than -ENOTRECOVERABLE: this
>	+             * branch is also reached for pages that are merely
>	+             * off-LRU transiently (e.g. a folio in the middle
>	+             * of migration or compaction), which shake_page()
>	+             * cannot drag back.  The caller cannot prove the
>	+             * page is permanently kernel-owned from here, so
>	+             * keep it on the recoverable errno.
>	+             */
>		ret = -EIO;
>

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-05-15  3:04 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260513-ecc_panic-v7-2-be2e578e61da@debian.org>

On 2026/5/13 23:39, Breno Leitao wrote:
> get_any_page() collapses three different failure modes into a single
> -EIO return:
> 
>   * the put_page race in the !count_increased path;
>   * the HWPoisonHandlable() rejection that bounces out of
>     __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
>   * the HWPoisonHandlable() rejection that goes through the
>     count_increased / put_page / shake_page retry loop.
> 
> The first is transient (the page is racing with the allocator).  The
> second can be either transient (a userspace folio briefly off LRU
> during migration/compaction) or stable (slab/vmalloc/page-table/
> kernel-stack pages).  The third describes a stable kernel-owned page
> that the count_increased=true caller already held a reference on.
> 
> Distinguish them on the return path: keep -EIO for both the put_page
> race and the -EBUSY-after-retries branch (shake_page() cannot drag a
> folio back from active migration, so we cannot prove the page is
> permanently kernel-owned from there), keep -EBUSY for the allocation
> race (unchanged), and return -ENOTRECOVERABLE only from the
> count_increased-true HWPoisonHandlable() rejection that exhausts its
> retries -- the caller's reference is structural evidence that the
> page is owned by the kernel.
> 
> Extend the unhandlable-page pr_err() to fire for either errno and
> update the get_hwpoison_page() kerneldoc.
> 
> memory_failure() still folds every negative return into
> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> this patch is a no-op for users of memory_failure() and only changes
> the errno that soft_offline_page() can propagate to its callers.  A
> follow-up wires the new return code through memory_failure() and
> reports MF_MSG_KERNEL for the unrecoverable cases.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 49bcfbd04d213..bae883df3ccb2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
>  				shake_page(p);
>  				goto try_again;
>  			}
> +			/*
> +			 * Return -EIO rather than -ENOTRECOVERABLE: this
> +			 * branch is also reached for pages that are merely
> +			 * off-LRU transiently (e.g. a folio in the middle
> +			 * of migration or compaction), which shake_page()
> +			 * cannot drag back.  The caller cannot prove the
> +			 * page is permanently kernel-owned from here, so
> +			 * keep it on the recoverable errno.
> +			 */
>  			ret = -EIO;
>  			goto out;
>  		}
> @@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
>  			goto try_again;
>  		}
>  		put_page(p);
> -		ret = -EIO;
> +		ret = -ENOTRECOVERABLE;

Theoretically, pages that are merely off-LRU transiently as you commented above could
reach here too? Or am I miss something?

Thanks.
.

>  	}
>  out:
> -	if (ret == -EIO)
> +	if (ret == -EIO || ret == -ENOTRECOVERABLE)
>  		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>  
>  	return ret;
> @@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
>   *         -EIO for pages on which we can not handle memory errors,
>   *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
>   *         operations like allocation and free,
> - *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> + *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
> + *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
> + *         cannot recover (PG_reserved, slab, vmalloc, page tables,
> + *         kernel stacks, and similar non-LRU/non-buddy pages).
>   */
>  static int get_hwpoison_page(struct page *p, unsigned long flags)
>  {
> 


^ permalink raw reply

* Re: [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Miaohe Lin @ 2026-05-15  2:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260513-ecc_panic-v7-1-be2e578e61da@debian.org>

On 2026/5/13 23:39, Breno Leitao wrote:
> The first entry of error_states[],
> 
> 	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
> 
> is unreachable.  identify_page_state() has two callers, and neither
> one can dispatch a PG_reserved page to me_kernel():
> 
>   * memory_failure() reaches identify_page_state() only after
>     get_hwpoison_page() returned 1.  get_any_page() reaches that
>     return only via __get_hwpoison_page(), which gates the refcount
>     on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
>     pages, so they fail with -EBUSY/-EIO long before
>     identify_page_state() runs.
> 
>   * try_memory_failure_hugetlb() reaches identify_page_state() on
>     the MF_HUGETLB_IN_USED branch, but the page is necessarily a
>     hugetlb folio there.  The first table entry that matches a
>     hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
>     they dispatch to me_huge_page() before the (now-removed)
>     reserved entry would have matched, regardless of whether
>     PG_reserved happens to be set on the head page.
> 
> me_kernel() never executes and the entry exists only to be matched
> against by code that cannot see it.
> 
> Drop the entry, the me_kernel() helper, and the now-unused
> "reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
> remains part of the tracepoint and pr_err() string tables, and
> follow-on work to classify unrecoverable kernel pages can reuse it
> without churning the user-visible enum.
> 
> No functional change.

As the code evolves, this entry is no longer needed. Thanks for cleanup.

> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>

With David's comments addressed, this patch looks good to me:

Acked-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.


^ permalink raw reply

* [RFC PATCH v2.2 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-15  0:44 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260515004433.128933-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  9 +++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 24fc402ab3c85..ec1e317923fd3 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT_CONDITION(damon_aggregated_v2,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_CONDITION(nr_probes > 0),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 1c9d2fb69f98d..ab8ac9ec8450d 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1881,6 +1881,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	if (trace_damon_aggregated_v2_enabled()) {
+		damon_for_each_probe(probe, c)
+			nr_probes++;
+	}
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1889,6 +1896,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_aggregated_v2(ti, r, damon_nr_regions(t),
+					nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2.2 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-15  0:44 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v2.1
- rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
- Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  48 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  67 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 197 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 221 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1284 insertions(+), 50 deletions(-)

base-commit: 5d6919055dec134de3c40167a490f33c74c12581
-- 
2.47.3

^ permalink raw reply

* Re: [RFC PATCH v2.1 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-15  0:41 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

On Thu, 14 May 2026 07:08:33 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.

Sashiko failed [1] reviewing this due to a problem at finding a fresh baseline
commit.  I will shortly post the next version (RFC v2.2) after rebasing to
mm-stable (7.1-rc3) for avoiding the issue.

[1] https://lore.kernel.org/20260514205555.51653-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH] sched/clock: Provide !HAVE_UNSTABLE_SCHED_CLOCK stub for sched_clock_stable()
From: Steven Rostedt @ 2026-05-14 19:36 UTC (permalink / raw)
  To: Yiyang Chen
  Cc: peterz, mingo, vincent.guittot, mhiramat, mathieu.desnoyers,
	linux-kernel, linux-trace-kernel
In-Reply-To: <56e45338858946cd9581b75c8bd45dd37dba52c5.1778773587.git.cyyzero16@gmail.com>

On Fri, 15 May 2026 00:05:05 +0800
Yiyang Chen <cyyzero16@gmail.com> wrote:

> When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is disabled, sched_clock() is
> already assumed to provide stable semantics, but the public header
> doesn't provide a sched_clock_stable() stub for that case.
> 
> Add a header stub that always returns true and clean up the duplicate
> local stub in ring_buffer.c, so callers can use sched_clock_stable()
> unconditionally.
> 
> Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
> ---
>  include/linux/sched/clock.h | 5 +++++
>  kernel/trace/ring_buffer.c  | 7 -------
>  2 files changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
> index 196f0ca351a2..39f0a7f94bfc 100644
> --- a/include/linux/sched/clock.h
> +++ b/include/linux/sched/clock.h
> @@ -33,6 +33,11 @@ extern u64 sched_clock_cpu(int cpu);
>  extern void sched_clock_init(void);
>  
>  #ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> +static inline int sched_clock_stable(void)
> +{
> +	return 1;
> +}
> +
>  static inline void sched_clock_tick(void)
>  {
>  }
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5326924615a4..02691c3c6dd6 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3769,13 +3769,6 @@ rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
>  	return skip_time_extend(event);
>  }
>  
> -#ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> -static inline bool sched_clock_stable(void)
> -{
> -	return true;
> -}
> -#endif
> -
>  static void
>  rb_check_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
>  		   struct rb_event_info *info)


Thanks for moving this out of the ring buffer code.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH v2] tracing: Allow perf to read synthetic events
From: Steven Rostedt @ 2026-05-14 18:47 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Arnaldo Carvalho de Melo,
	Jiri Olsa, Namhyung Kim, Peter Zijlstra, Ian Rogers, sashiko
In-Reply-To: <20260513150007.3b280e87@gandalf.local.home>


Replying to Sashiko's comments:

  https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260513150007.3b280e87%40gandalf.local.home

> commit b3956023bf5f1ccf5838187c1c14f3dcf5e37c3d
> Author: Steven Rostedt <rostedt@goodmis.org>
> tracing: Allow perf to read synthetic events
> This commit adds support for perf to read synthetic events. It extracts the
> field sizing and entry writing logic into separate helper functions, and
> implements a new probe function to attach perf events.
> > diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
> > index 39ac4eba07024..e6871230bde96 100644
> > --- a/kernel/trace/trace_events_synth.c
> > +++ b/kernel/trace/trace_events_synth.c
> > @@ -499,28 +499,19 @@ static unsigned int trace_stack(struct synth_trace_event *entry,
> [ ... ]
> > +static __always_inline void write_synth_entry(struct synth_event *event,
> > +					      struct synth_trace_event *entry,
> > +					      u64 *var_ref_vals,
> > +					      unsigned int *var_ref_idx)
> > +{
> > +	int data_size = 0;
> Since get_field_size() and write_synth_entry() are being refactored here, is
> there a risk of a time-of-check to time-of-use buffer overflow with
> dynamic string fields?
> get_field_size() computes the string length for allocation by calling
> fetch_store_strlen(). The ring buffer event slot is then allocated based on
> this length.
> write_synth_entry() populates the buffer by calling trace_string(), which
> calls fetch_store_strlen() a second time to recalculate the string's length,
> assigning it to data->as_dynamic.len.
> If a string from user space grows between the two fetch_store_strlen() calls,
> fetch_store_string() will blindly copy more bytes than the allocated buffer
> space. Can this cause out-of-bounds writes and corrupt kernel memory?
> This regression wasn't introduced by this patch, but pre-existed in the
> ftrace implementation.

The string cannot grow between the two calls. When a dynamic event is
enabled (as it is required to be here) deleting or modifying it will be
denied via an -EBUSY return. That's the only way I'm aware of that can have
user space change the size of the string.

> [ ... ]
> > +#ifdef CONFIG_PERF_EVENTS
> > +static void perf_event_raw_event_synth(void *__data,
> > +				       u64 *var_ref_vals,
> > +				       unsigned int *var_ref_idx)
> > +{
> [ ... ]
> > +	size = ALIGN(sizeof(*entry) + fields_size, 8);
> > +
> > +	entry = perf_trace_buf_alloc(size, &regs, &context);
> > +
> > +	if (unlikely(!entry))
> > +		return;
> > +
> > +	write_synth_entry(event, entry, var_ref_vals, var_ref_idx);
> > +
> > +	perf_fetch_caller_regs(regs);
> > +
> > +	perf_trace_buf_submit(entry, size, context,
> > +			      call->event.type, 1, regs,
> > +			      perf_head, NULL);
> > +}
> > +#endif
> Could this leak uninitialized per-CPU kernel memory to userspace?
> perf_trace_buf_alloc() and trace_event_buffer_reserve() allocate memory from
> per-CPU buffers without zero-initializing it (except for trailing alignment
> padding).

Both of theses per-CPU memory locations are zeroed when created. It is only
populated with trace data. Yeah, it may leak previous trace data, but that
data should also be visible for whoever created it in the first place.

-- Steve


> write_synth_entry() populates an array of 8-byte union trace_synth_field.
> When a field is smaller than 8 bytes (e.g., as_u8), only those specific bytes
> are written, leaving the remaining padding bytes uninitialized. Similarly, for
> empty stacktraces, get_field_size() reserves 8 bytes, but trace_stack()
> terminates immediately and writes 0 bytes, leaving the 8-byte gap completely
> uninitialized.
> perf_trace_buf_submit() copies the entire requested buffer size to the perf
> ring buffer. Can userspace read this buffer, leaking uninitialized kernel
> memory from previous events or kernel operations?
> This regression also wasn't introduced by this patch, but pre-existed in the
> ftrace implementation.


^ permalink raw reply

* Re: [PATCH v2] tracing: Allow perf to read synthetic events
From: Namhyung Kim @ 2026-05-14 18:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ian Rogers
In-Reply-To: <20260513150007.3b280e87@gandalf.local.home>

On Wed, May 13, 2026 at 03:00:07PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Currently, perf can not enable synthetic events. When it does, it either
> causes a warning in the kernel or errors with "no such device".
> 
> Add the necessary code to allow perf to also attach to synthetic events.
> 
> Reported-by: Ian Rogers <irogers@google.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
> Changes since v1: https://patch.msgid.link/20251217113920.50b56246@gandalf.local.home
> 
> - Forward ported to v7.1-rc2
> 
>  kernel/trace/trace_events_synth.c | 121 +++++++++++++++++++++++-------
>  1 file changed, 94 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
> index 39ac4eba0702..e6871230bde9 100644
> --- a/kernel/trace/trace_events_synth.c
> +++ b/kernel/trace/trace_events_synth.c
> @@ -499,28 +499,19 @@ static unsigned int trace_stack(struct synth_trace_event *entry,
>  	return len;
>  }
>  
> -static void trace_event_raw_event_synth(void *__data,
> -					u64 *var_ref_vals,
> -					unsigned int *var_ref_idx)
> +static __always_inline int get_field_size(struct synth_event *event,
> +					  u64 *var_ref_vals,
> +					  unsigned int *var_ref_idx)
>  {
> -	unsigned int i, n_u64, val_idx, len, data_size = 0;
> -	struct trace_event_file *trace_file = __data;
> -	struct synth_trace_event *entry;
> -	struct trace_event_buffer fbuffer;
> -	struct trace_buffer *buffer;
> -	struct synth_event *event;
> -	int fields_size = 0;
> -
> -	event = trace_file->event_call->data;
> -
> -	if (trace_trigger_soft_disabled(trace_file))
> -		return;
> +	int fields_size;
>  
>  	fields_size = event->n_u64 * sizeof(u64);
>  
> -	for (i = 0; i < event->n_dynamic_fields; i++) {
> +	for (int i = 0; i < event->n_dynamic_fields; i++) {
>  		unsigned int field_pos = event->dynamic_fields[i]->field_pos;
>  		char *str_val;
> +		int val_idx;
> +		int len;
>  
>  		val_idx = var_ref_idx[field_pos];
>  		str_val = (char *)(long)var_ref_vals[val_idx];
> @@ -535,18 +526,18 @@ static void trace_event_raw_event_synth(void *__data,
>  
>  		fields_size += len;
>  	}
> +	return fields_size;
> +}
>  
> -	/*
> -	 * Avoid ring buffer recursion detection, as this event
> -	 * is being performed within another event.
> -	 */
> -	buffer = trace_file->tr->array_buffer.buffer;
> -	guard(ring_buffer_nest)(buffer);
> -
> -	entry = trace_event_buffer_reserve(&fbuffer, trace_file,
> -					   sizeof(*entry) + fields_size);
> -	if (!entry)
> -		return;
> +static __always_inline void write_synth_entry(struct synth_event *event,
> +					      struct synth_trace_event *entry,
> +					      u64 *var_ref_vals,
> +					      unsigned int *var_ref_idx)
> +{
> +	int data_size = 0;
> +	int i, n_u64;
> +	int val_idx;
> +	int len;
>  
>  	for (i = 0, n_u64 = 0; i < event->n_fields; i++) {
>  		val_idx = var_ref_idx[i];
> @@ -587,10 +578,83 @@ static void trace_event_raw_event_synth(void *__data,
>  			n_u64++;
>  		}
>  	}
> +}
> +
> +static void trace_event_raw_event_synth(void *__data,
> +					u64 *var_ref_vals,
> +					unsigned int *var_ref_idx)
> +{
> +	struct trace_event_file *trace_file = __data;
> +	struct synth_trace_event *entry;
> +	struct trace_event_buffer fbuffer;
> +	struct trace_buffer *buffer;
> +	struct synth_event *event;
> +	int fields_size;
> +
> +	event = trace_file->event_call->data;
> +
> +	if (trace_trigger_soft_disabled(trace_file))
> +		return;
> +
> +	fields_size = get_field_size(event, var_ref_vals, var_ref_idx);
> +
> +	/*
> +	 * Avoid ring buffer recursion detection, as this event
> +	 * is being performed within another event.
> +	 */
> +	buffer = trace_file->tr->array_buffer.buffer;
> +	guard(ring_buffer_nest)(buffer);
> +
> +	entry = trace_event_buffer_reserve(&fbuffer, trace_file,
> +					   sizeof(*entry) + fields_size);
> +	if (!entry)
> +		return;
> +
> +	write_synth_entry(event, entry, var_ref_vals, var_ref_idx);
>  
>  	trace_event_buffer_commit(&fbuffer);
>  }
>  
> +#ifdef CONFIG_PERF_EVENTS
> +static void perf_event_raw_event_synth(void *__data,
> +				       u64 *var_ref_vals,
> +				       unsigned int *var_ref_idx)
> +{
> +	struct trace_event_call *call = __data;
> +	struct synth_trace_event *entry;
> +	struct hlist_head *perf_head;
> +	struct synth_event *event;
> +	struct pt_regs *regs;
> +	int fields_size;
> +	size_t size;
> +	int context;
> +
> +	event = call->data;
> +
> +	perf_head = this_cpu_ptr(call->perf_events);
> +
> +	if (!perf_head || hlist_empty(perf_head))
> +		return;
> +
> +	fields_size = get_field_size(event, var_ref_vals, var_ref_idx);
> +
> +	size = ALIGN(sizeof(*entry) + fields_size, 8);
> +
> +	entry = perf_trace_buf_alloc(size, &regs, &context);
> +
> +	if (unlikely(!entry))
> +		return;
> +
> +	write_synth_entry(event, entry, var_ref_vals, var_ref_idx);
> +
> +	perf_fetch_caller_regs(regs);
> +
> +	perf_trace_buf_submit(entry, size, context,
> +			      call->event.type, 1, regs,
> +			      perf_head, NULL);
> +}
> +#endif
> +
>  static void free_synth_event_print_fmt(struct trace_event_call *call)
>  {
>  	if (call) {
> @@ -917,6 +981,9 @@ static int register_synth_event(struct synth_event *event)
>  	call->flags = TRACE_EVENT_FL_TRACEPOINT;
>  	call->class->reg = synth_event_reg;
>  	call->class->probe = trace_event_raw_event_synth;
> +#ifdef CONFIG_PERF_EVENTS
> +	call->class->perf_probe = perf_event_raw_event_synth;
> +#endif
>  	call->data = event;
>  	call->tp = event->tp;
>  
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH v2] fprobe: Fix unregister_fprobe() to wait for RCU grace period
From: patchwork-bot+netdevbpf @ 2026-05-14 17:12 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: rostedt, ast, daniel, andrii, jolsa, mathieu.desnoyers,
	linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com>

Hello:

This patch was applied to netdev/net.git (main)
by Masami Hiramatsu (Google) <mhiramat@kernel.org>:

On Thu,  7 May 2026 16:46:29 +0900 you wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer")
> changed fprobe to register struct fprobe to an rcu-hlist, but it forgot
> to wait for RCU GP. Thus there can be use-after-free if the fprobe is
> released right after unregistering. This can be happened on fprobe
> event and sample module code.
> 
> [...]

Here is the summary with links:
  - [v2] fprobe: Fix unregister_fprobe() to wait for RCU grace period
    https://git.kernel.org/netdev/net/c/657b594b2084

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jakub Sitnicki @ 2026-05-14 16:54 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-2-jolsa@kernel.org>

On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
> 
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
> 
>   lea -0x80(%rsp), %rsp
>   call tramp
> 
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.
> 
> The optimized uprobe performance stays the same:
> 
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
> 
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
>  1 file changed, 86 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..f7c4101a4039 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -636,9 +636,21 @@ struct uprobe_trampoline {
>  	unsigned long		vaddr;
>  };
>  
> +#define LEA_INSN_SIZE		5
> +#define OPT_INSN_SIZE		(LEA_INSN_SIZE + CALL_INSN_SIZE)
> +#define OPT_JMP8_OFFSET		(OPT_INSN_SIZE - JMP8_INSN_SIZE)
> +#define REDZONE_SIZE		0x80
> +
> +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> +
> +static bool is_lea_insn(const uprobe_opcode_t *insn)
> +{
> +	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> +}
> +

Just a thought. See if below maybe reads better when plugged in.
is_call_insn can then be removed, I think.

static bool is_call_past_redzone_insns(const uprobe_opcode_t *insn)
{
	static const u8 lea_rsp_call[] = {
		0x48, 0x8d, 0x64, 0x24, REDZONE_SIZE, /* lea -0x80(%rsp), %rsp */
		CALL_INSN_OPCODE
	};

	return !memcmp(insn, lea_rsp_call, ARRAY_SIZE(lea_rsp_call));
}

[...]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox