Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-08 17:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pengpeng Hou, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408122514.60bbfd61@gandalf.local.home>

On Wed, 2026-04-08 at 12:25 -0400, Steven Rostedt wrote:
> On Wed, 08 Apr 2026 10:58:06 -0500
> Tom Zanussi <zanussi@kernel.org> wrote:
> 
> Hi Tom,
> 
> > ->system is set when using fully-qualified variable names. For  
> > instance:
> > 
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
> > echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > 
> > Here, the sched_switch trigger would error out if the unqualified $ts0
> > variables were used instead of the fully-qualified ones because there's
> > no way to distinguish which $ts0 was meant.
> > 
> 
> Yep I see that now. I never had a need to use it before, but I probably
> should implement this in libtracefs to be safe.
> 
> We should definitely add a selftest that tests this. There's one case that
> does use it but it doesn't use multiple ones. We should add a test that
> does so.
> 
> trigger-multi-actions-accept.tc has the system, but it's not needed here.
> 
> We should also have a test to test the output of theses lines.

Yeah, definitely. I can try adding this as a test..

Tom


> 
> -- Steve


^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-08 16:54 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth
  Cc: Vishal Annapurve, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <adWidf8UgZeYctr1@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Apr 07, 2026, Michael Roth wrote:
>> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
>> > On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
>> > >
>> > > > TLDR:
>> > > >
>> > > > + Think of populate ioctls not as KVM touching memory, but platform
>> > > >   handling population.
>> > > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
>> > > > + post_populate is platform-specific code that handles loading into
>> > > >   private destination memory just to support legacy non-in-place
>> > > >   conversion.
>> > > > + Don't complicate populate ioctls by doing conversion just to support
>> > > >   legacy use-cases where platform-specific code has to do copying on
>> > > >   the host.
>> > >
>> > > That's a good point: these are only considerations in the context of
>> > > actually copying from src->dst, but with in-place conversion the
>> > > primary/more-performant approach will be for userspace to initial
>> > > directly. I.e. if we enforced that, then gmem could right ascertain that
>> > > it isn't even writing to private pages via these hooks and any
>> > > manipulation of that memory is purely on the part of the trusted entity
>> > > handling initial encryption/etc.
>> > >
>> > > I understand that we decided to keep the option of allowing separate
>> > > src/dst even with in-place conversion, but it doesn't seem worthwhile if
>> > > that necessarily means we need to glue population+conversion together in
>> > > 1 clumsy interface that needs to handle partial return/error responses to
>> > > userspace (or potentially get stuck forever in the conversion path).
>> >
>> > I think ARM needs userspace to specify separate source and destination
>> > memory ranges for initial population as ARM doesn't support in-place
>> > memory encryption. [1]
>> >
>> > [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>> >
>> > >
>> > > So I agree with Ackerley's proposal (which I guess is the same as what's
>> > > in this series).
>> > >
>> > > However, 1 other alternative would be to do what was suggested on the
>> > > call, but require userspace to subsequently handle the shared->private
>> > > conversion. I think that would be workable too.
>> >
>> > IIUC, Converting memory ranges to private after it essentially is
>> > treated as private by the KVM CC backend will expose the
>> > implementation to the same risk of userspace being able to access
>> > private memory and compromise host safety which guest_memfd was
>> > invented to address.
>>
>> Doh, fair point. Doing conversion as part of the populate call would allow
>> us to use the filemap write-lock to avoid userspace being able to fault
>> in private (as tracked by trusted entity) pages before they are
>> transitioned to private (as tracked by KVM), so it's safer than having
>> userspace drive it.
>>
>> But obviously I still think Ackerley's original proposal has more
>> upsides than the alternatives mentioned so far.
>
> I'm a bit lost.  What exactly is/was Ackerley's original proposal?  If the answer
> is "convert pages from shared=>private when populating via in-place conversion",
> then I agree, because AFAICT, that's the only sane option.

Discussed this at PUCK today 2026-04-08.

The update is that the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl will
now support the PRESERVE flag for TDX and SNP only if the setup for the
VM in question hasn't yet been completed (KVM_TDX_FINALIZE_VM or
KVM_SEV_SNP_LAUNCH_FINISH hasn't completed yet).

The populate flow will be

1a. Get contents to be loaded in guest_memfd (src_addr: NULL) as shared
OR
1b. Provide contents from some other userspace address (src_addr:
    userspace address)

2.  KVM_SET_MEMORY_ATTRIBUTES2(attribute: PRIVATE and flags: PRESERVE)
3.  KVM_SEV_SNP_LAUNCH_UPDATE() or KVM_TDX_INIT_MEM_REGION()
...
4.  KVM_SEV_SNP_LAUNCH_FINISH() or KVM_TDX_FINALIZE_VM()

This applies whether src_addr is some userspace address that is shared
or NULL, so the non-in-place loading flow is not considered legacy. ARM
CCA can still use that flow :)

Other than supporting PRESERVE only if the setup for the VM in question
hasn't yet been completed, KVM's fault path will also not permit faults
if the setup hasn't been completed. (Some exception setup will be used
for TDX to be able to perform the required fault.)

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Steven Rostedt @ 2026-04-08 16:25 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Pengpeng Hou, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <f59d594ff21658db45c58a094edeab0f92ae8345.camel@kernel.org>

On Wed, 08 Apr 2026 10:58:06 -0500
Tom Zanussi <zanussi@kernel.org> wrote:

Hi Tom,

> ->system is set when using fully-qualified variable names. For  
> instance:
> 
> echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
> echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
> echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> 
> Here, the sched_switch trigger would error out if the unqualified $ts0
> variables were used instead of the fully-qualified ones because there's
> no way to distinguish which $ts0 was meant.
> 

Yep I see that now. I never had a need to use it before, but I probably
should implement this in libtracefs to be safe.

We should definitely add a selftest that tests this. There's one case that
does use it but it doesn't use multiple ones. We should add a test that
does so.

trigger-multi-actions-accept.tc has the system, but it's not needed here.

We should also have a test to test the output of theses lines.

-- Steve

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-08 15:58 UTC (permalink / raw)
  To: Steven Rostedt, Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260407210502.102e5d37@gandalf.local.home>

Hi Steve,

On Tue, 2026-04-07 at 21:05 -0400, Steven Rostedt wrote:
> 
> Tom,
> 
> On Wed,  1 Apr 2026 19:22:23 +0800
> Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> 
> > hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
> > qualified variable-reference names, but it currently appends into that
> > buffer with strcat() without rebuilding it first. As a result, repeated
> > calls append a new "system.event.field" name onto the previous one,
> > which can eventually run past the end of full_name.
> > 
> > Build the name with snprintf() on each call and return NULL if the fully
> > qualified name does not fit in MAX_FILTER_STR_VAL.
> > 
> > Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
> > Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> > ---
> > Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
> > 
> > - rebuild full_name on each call instead of falling back to field->name
> > - return NULL on overflow as suggested
> > - split out the snprintf() length check instead of using an inline if
> > 
> >  kernel/trace/trace_events_hist.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> > index 73ea180cad55..f9c8a4f078ea 100644
> > --- a/kernel/trace/trace_events_hist.c
> > +++ b/kernel/trace/trace_events_hist.c
> > @@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
> >  		 field->flags & HIST_FIELD_FL_VAR_REF) {
> >  		if (field->system) {
> >  			static char full_name[MAX_FILTER_STR_VAL];
> > +			int len;
> > +
> > +			len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
> > +				       field->system, field->event_name,
> > +				       field->name);
> > +			if (len >= sizeof(full_name))
> > +				return NULL;
> >  
> > -			strcat(full_name, field->system);
> > -			strcat(full_name, ".");
> > -			strcat(full_name, field->event_name);
> > -			strcat(full_name, ".");
> > -			strcat(full_name, field->name);
> >  			field_name = full_name;
> 
> I wanted to test this but I can't find anything that triggers this path.
> How does a field here get its ->system set?
> 

->system is set when using fully-qualified variable names. For
instance:

echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

Here, the sched_switch trigger would error out if the unqualified $ts0
variables were used instead of the fully-qualified ones because there's
no way to distinguish which $ts0 was meant.

Tom



> If there's no way to hit this path, I much rather remove it than "fix" it.
> 
> -- Steve
> 
> 
> >  		} else
> >  			field_name = field->name;
> 


^ permalink raw reply

* [PATCH 2/2] selftests/ftrace: Check exact trace_marker_raw payload lengths
From: Cao Ruichuang @ 2026-04-08 15:32 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, shuah
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260408153241.15391-1-create0818@163.com>

trace_marker_raw.tc currently depends on awk strtonum() and assumes
that the printed raw-data byte count is rounded up to four bytes.

Now that TRACE_RAW_DATA records keep the true payload length in the
event itself, update the testcase to validate the exact number of bytes
printed for a short sequence of writes. While doing that, make the test
portable to /bin/sh environments that use mawk by replacing strtonum()
and the lscpu endian probe with od-based checks.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
 .../ftrace/test.d/00basic/trace_marker_raw.tc | 93 ++++++++++++-------
 1 file changed, 59 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
index a2c42e13f..3b37890f8 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -1,11 +1,11 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0
 # description: Basic tests on writing to trace_marker_raw
-# requires: trace_marker_raw
+# requires: trace_marker_raw od:program
 # flags: instance
 
 is_little_endian() {
-	if lscpu | grep -q 'Little Endian'; then
+	if [ "$(printf '\001\000\000\000' | od -An -tu4 | tr -d '[:space:]')" = "1" ]; then
 		echo 1;
 	else
 		echo 0;
@@ -34,7 +34,7 @@ make_str() {
 
 	data=`printf -- 'X%.0s' $(seq $cnt)`
 
-	printf "${val}${data}"
+	printf "%b%s" "${val}" "${data}"
 }
 
 write_buffer() {
@@ -47,36 +47,68 @@ write_buffer() {
 
 
 test_multiple_writes() {
+	out_file=$TMPDIR/trace_marker_raw.out
+	match_file=$TMPDIR/trace_marker_raw.lines
+	wait_iter=0
+	pause_on_trace=
+
+	if [ -f options/pause-on-trace ]; then
+		pause_on_trace=`cat options/pause-on-trace`
+		echo 0 > options/pause-on-trace
+	fi
+
+	: > trace
+	cat trace_pipe > $out_file &
+	reader_pid=$!
+	sleep 1
+
+	# Write sizes that cover both the short and long raw-data encodings
+	# without overflowing the trace buffer before we can verify them.
+	for i in `seq 1 12`; do
+		write_buffer 0x12345678 $i
+	done
 
-	# Write a bunch of data where the id is the count of
-	# data to write
-	for i in `seq 1 10` `seq 101 110` `seq 1001 1010`; do
-		write_buffer $i $i
+	while [ "`grep -c ' buf:' $out_file 2> /dev/null || true`" -lt 12 ]; do
+		wait_iter=$((wait_iter + 1))
+		if [ $wait_iter -ge 10 ]; then
+			kill $reader_pid 2> /dev/null || true
+			wait $reader_pid 2> /dev/null || true
+			if [ -n "$pause_on_trace" ]; then
+				echo $pause_on_trace > options/pause-on-trace
+			fi
+			return 1
+		fi
+		sleep 1
 	done
 
 	# add a little buffer
 	echo stop > trace_marker
+	sleep 1
+	kill $reader_pid 2> /dev/null || true
+	wait $reader_pid 2> /dev/null || true
+	if [ -n "$pause_on_trace" ]; then
+		echo $pause_on_trace > options/pause-on-trace
+	fi
 
-	# Check to make sure the number of entries is the id (rounded up by 4)
-	awk '/.*: # [0-9a-f]* / {
-			print;
-			cnt = -1;
-			for (i = 0; i < NF; i++) {
-				# The counter is after the "#" marker
-				if ( $i == "#" ) {
-					i++;
-					cnt = strtonum("0x" $i);
-					num = NF - (i + 1);
-					# The number of items is always rounded up by 4
-					cnt2 = int((cnt + 3) / 4) * 4;
-					if (cnt2 != num) {
-						exit 1;
-					}
-					break;
-				}
-			}
-		}
-	// { if (NR > 30) { exit 0; } } ' trace_pipe;
+	grep ' buf:' $out_file > $match_file || return 1
+	if [ "`wc -l < $match_file`" -ne 12 ]; then
+		cat $match_file
+		return 1
+	fi
+
+	# Check to make sure the number of byte values matches the id exactly.
+	for expected in `seq 1 12`; do
+		line=`sed -n "${expected}p" $match_file`
+		if [ -z "$line" ]; then
+			return 1
+		fi
+		rest=${line#* buf: }
+		set -- $rest
+		if [ "$#" -ne "$expected" ]; then
+			echo "$line"
+			return 1
+		fi
+	done
 }
 
 
@@ -107,13 +139,6 @@ test_buffer() {
 
 ORIG=`cat buffer_size_kb`
 
-# test_multiple_writes test needs at least 12KB buffer
-NEW_SIZE=12
-
-if [ ${ORIG} -lt ${NEW_SIZE} ]; then
-	echo ${NEW_SIZE} > buffer_size_kb
-fi
-
 test_buffer
 if ! test_multiple_writes; then
 	echo ${ORIG} > buffer_size_kb
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* [PATCH 1/2] tracing: Store trace_marker_raw payload length in events
From: Cao Ruichuang @ 2026-04-08 15:32 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, shuah
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest

trace_marker_raw currently records its bytes in TRACE_RAW_DATA events,
but the event output path derives the byte count from the padded record
size in the ring buffer. As a result, the printed raw-data payload is
rounded up and small writes do not preserve their true length.

Keep the true payload length in the TRACE_RAW_DATA event itself and use
that field when printing the bytes. This leaves the ring buffer record
size semantics unchanged while letting trace_marker_raw report the exact
payload that was written.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
 kernel/trace/trace.c         | 11 ++++++-----
 kernel/trace/trace_entries.h |  1 +
 kernel/trace/trace_output.c  |  4 ++--
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a626211ce..d9cb643b8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6906,11 +6906,13 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
 	struct ring_buffer_event *event;
 	struct trace_buffer *buffer;
 	struct raw_data_entry *entry;
+	size_t payload_len;
 	ssize_t written;
 	size_t size;
 
 	/* cnt includes both the entry->id and the data behind it. */
-	size = struct_offset(entry, id) + cnt;
+	payload_len = cnt - sizeof(entry->id);
+	size = struct_offset(entry, buf) + payload_len;
 
 	buffer = tr->array_buffer.buffer;
 
@@ -6924,10 +6926,9 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
 		return -EBADF;
 
 	entry = ring_buffer_event_data(event);
-	unsafe_memcpy(&entry->id, buf, cnt,
-		      "id and content already reserved on ring buffer"
-		      "'buf' includes the 'id' and the data."
-		      "'entry' was allocated with cnt from 'id'.");
+	memcpy(&entry->id, buf, sizeof(entry->id));
+	entry->len = payload_len;
+	memcpy(entry->buf, buf + sizeof(entry->id), payload_len);
 	written = cnt;
 
 	__buffer_unlock_commit(buffer, event);
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468f..5f867a144 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -288,6 +288,7 @@ FTRACE_ENTRY(raw_data, raw_data_entry,
 
 	F_STRUCT(
 		__field(	unsigned int,	id	)
+		__field(unsigned int, len)
 		__dynamic_array(	char,	buf	)
 	),
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 1996d7aba..4e1edfa05 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1817,13 +1817,13 @@ static enum print_line_t trace_raw_data(struct trace_iterator *iter, int flags,
 					 struct trace_event *event)
 {
 	struct raw_data_entry *field;
-	int i;
+	unsigned int i;
 
 	trace_assign_type(field, iter->ent);
 
 	trace_seq_printf(&iter->seq, "# %x buf:", field->id);
 
-	for (i = 0; i < iter->ent_size - offsetof(struct raw_data_entry, buf); i++)
+	for (i = 0; i < field->len; i++)
 		trace_seq_printf(&iter->seq, " %02x",
 				 (unsigned char)field->buf[i]);
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH v2 1/2] module/kallsyms: fix nextval for data symbol lookup
From: Petr Pavlu @ 2026-04-08 15:24 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-modules, Sami Tolvanen, Luis Chamberlain, linux-kernel,
	linux-trace-kernel, live-patching, Daniel Gomez, Aaron Tomlin,
	Steven Rostedt, Masami Hiramatsu, Jordan Rome, Viktor Malik
In-Reply-To: <20260327110005.16499-1-stf_xl@wp.pl>

On 3/27/26 12:00 PM, Stanislaw Gruszka wrote:
> The symbol lookup code assumes the queried address resides in either
> MOD_TEXT or MOD_INIT_TEXT. This breaks for addresses in other module
> memory regions (e.g. rodata or data), resulting in incorrect upper
> bounds and wrong symbol size.
> 
> Select the module memory region the address belongs to instead of
> hardcoding text sections. Also initialize the lower bound to the start
> of that region, as searching from address 0 is unnecessary.
> 
> Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>

Looks ok to me. Feel free to add:

Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>

As a side note, I wonder if manually determining symbol sizes this way
is the best approach for modules, instead of simply returning the
st_size of the symbol. The logic comes from the original implementation
in "[PATCH] kallsyms for new modules" [1]. Unfortunately, the
description doesn't explain this aspect but considering that the patch
rewrote both the main and module kallsyms code, I expect it was done
this way for consistency between vmlinux and modules.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=d069cf94ca296b7fb4c7e362e8f27e2c8aca70f1

-- 
Thanks,
Petr

^ permalink raw reply

* [PATCH v2] seq_buf: export seq_buf_putmem_hex() and add KUnit tests
From: Shuvam Pandey @ 2026-04-08 14:44 UTC (permalink / raw)
  To: Andrew Morton, Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, shuvampandey1
In-Reply-To: <20260406033728.25998-1-shuvampandey1@gmail.com>

seq_buf: export seq_buf_putmem_hex() and add KUnit tests

The seq_buf KUnit suite does not exercise seq_buf_putmem_hex().

Add one test for the len > 8 chunking path and one overflow test
where a later chunk no longer fits in the buffer.

Export seq_buf_putmem_hex() as well so SEQ_BUF_KUNIT_TEST=m links
cleanly. Without the export, modpost reports seq_buf_putmem_hex as
undefined when seq_buf_kunit is built as a module.

Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
---
v2:
- export seq_buf_putmem_hex() so SEQ_BUF_KUNIT_TEST=m links cleanly
- validate with a fresh arm64 build using CONFIG_KUNIT=y and CONFIG_SEQ_BUF_KUNIT_TEST=m

 lib/seq_buf.c             |  1 +
 lib/tests/seq_buf_kunit.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index f3f3436d60a9403eae5b1ef9b091b027881f14fb..b59488fa8135cdb0340fbeb43d8d74db8ae13146 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -298,6 +298,7 @@ int seq_buf_putmem_hex(struct seq_buf *s, const void *mem,
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(seq_buf_putmem_hex);
 
 /**
  * seq_buf_path - copy a path into the sequence buffer
diff --git a/lib/tests/seq_buf_kunit.c b/lib/tests/seq_buf_kunit.c
index 8a01579a978e655cd09024d0ea9c4c9cd095263f..eb466386bbefb1c81773cdae65a8ac3df91cd8ea 100644
--- a/lib/tests/seq_buf_kunit.c
+++ b/lib/tests/seq_buf_kunit.c
@@ -184,6 +184,38 @@ static void seq_buf_get_buf_commit_test(struct kunit *test)
 	KUNIT_EXPECT_TRUE(test, seq_buf_has_overflowed(&s));
 }
 
+static void seq_buf_putmem_hex_test(struct kunit *test)
+{
+	DECLARE_SEQ_BUF(s, 24);
+	const u8 data[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+#ifdef __BIG_ENDIAN
+	const char *expected = "0001020304050607 0809 ";
+#else
+	const char *expected = "0706050403020100 0908 ";
+#endif
+
+	KUNIT_EXPECT_EQ(test, seq_buf_putmem_hex(&s, data, sizeof(data)), 0);
+	KUNIT_EXPECT_FALSE(test, seq_buf_has_overflowed(&s));
+	KUNIT_EXPECT_EQ(test, seq_buf_used(&s), strlen(expected));
+	KUNIT_EXPECT_STREQ(test, seq_buf_str(&s), expected);
+}
+
+static void seq_buf_putmem_hex_overflow_test(struct kunit *test)
+{
+	DECLARE_SEQ_BUF(s, 20);
+	const u8 data[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+#ifdef __BIG_ENDIAN
+	const char *expected = "0001020304050607 ";
+#else
+	const char *expected = "0706050403020100 ";
+#endif
+
+	KUNIT_EXPECT_EQ(test, seq_buf_putmem_hex(&s, data, sizeof(data)), -1);
+	KUNIT_EXPECT_TRUE(test, seq_buf_has_overflowed(&s));
+	KUNIT_EXPECT_EQ(test, seq_buf_used(&s), 20);
+	KUNIT_EXPECT_STREQ(test, seq_buf_str(&s), expected);
+}
+
 static struct kunit_case seq_buf_test_cases[] = {
 	KUNIT_CASE(seq_buf_init_test),
 	KUNIT_CASE(seq_buf_declare_test),
@@ -194,6 +226,8 @@ static struct kunit_case seq_buf_test_cases[] = {
 	KUNIT_CASE(seq_buf_printf_test),
 	KUNIT_CASE(seq_buf_printf_overflow_test),
 	KUNIT_CASE(seq_buf_get_buf_commit_test),
+	KUNIT_CASE(seq_buf_putmem_hex_test),
+	KUNIT_CASE(seq_buf_putmem_hex_overflow_test),
 	{}
 };
 

^ permalink raw reply related

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Jeff Layton @ 2026-04-08 14:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Christian Brauner, Chuck Lever, Alexander Aring,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker,
	Amir Goldstein, Calum Mackay, linux-fsdevel, linux-kernel,
	linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <snnggefctfffpb3rsyhjdwmxozqdklqmweiojmxy7owettksgz@6vud2iacgeqc>

On Wed, 2026-04-08 at 15:45 +0200, Jan Kara wrote:
> On Tue 07-04-26 09:21:14, Jeff Layton wrote:
> > If a NFS client requests a directory delegation with a notification
> > bitmask covering directory change events, the server shouldn't recall
> > the delegation. Instead the client will be notified of the change after
> > the fact.
> > 
> > Add support for ignoring lease breaks on directory changes. Add a new
> > flags parameter to try_break_deleg() and teach __break_lease how to
> > ignore certain types of delegation break events.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > @@ -222,6 +225,10 @@ struct file_lease *locks_alloc_lease(void);
> >  #define LEASE_BREAK_LAYOUT		BIT(2)	// break layouts only
> >  #define LEASE_BREAK_NONBLOCK		BIT(3)	// non-blocking break
> >  #define LEASE_BREAK_OPEN_RDONLY		BIT(4)	// readonly open event
> > +#define LEASE_BREAK_DIR_CREATE		BIT(6)  // dir deleg create event
> > +#define LEASE_BREAK_DIR_DELETE		BIT(7)  // dir deleg delete event
> > +#define LEASE_BREAK_DIR_RENAME		BIT(8)  // dir deleg rename event
> 
> Just curious why you've left out bit 5 here... :)
> 
> 								Honza

No reason. I've had this series for a couple of years now, and I think
bit 5 got removed at some point after I originally did this patch, and
I didn't notice when I fixed up the conflict. I'll plan to renumber
this for neatness sake.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH 00/24] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Jan Kara @ 2026-04-08 13:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:13, Jeff Layton wrote:
> This patchset builds on the directory delegation work we did a few
> months ago, to add support for CB_NOTIFY callbacks for some events. In
> particular, creates, unlinks and renames. The server also sends updated
> directory attributes in the notifications. With this support, the client
> can register interest in a directory and get notifications about changes
> within it without losing its lease.
> 
> The series starts with patches to allow the vfs to ignore certain types
> of events on directories. nfsd can then request these sorts of
> delegations on directories, and then set up inotify watches on the
> directory to trigger sending CB_NOTIFY events.
> 
> This has mainly been tested with pynfs, with some new testcases that
> I'll be posting soon. They seem to work fine with those tests, but I
> don't think we'll want to merge these until we have a complete
> client-side implementation to test against.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

The fsnotify changes and generic file locking changes look OK to me. I
don't feel confident enough with NFSD stuff to really review that :)

								Honza

> ---
> Jeff Layton (24):
>       filelock: add support for ignoring deleg breaks for dir change events
>       filelock: add a tracepoint to start of break_lease()
>       filelock: add an inode_lease_ignore_mask helper
>       nfsd: add protocol support for CB_NOTIFY
>       nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
>       nfsd: allow nfsd to get a dir lease with an ignore mask
>       vfs: add fsnotify_modify_mark_mask()
>       nfsd: update the fsnotify mark when setting or removing a dir delegation
>       nfsd: make nfsd4_callback_ops->prepare operation bool return
>       nfsd: add callback encoding and decoding linkages for CB_NOTIFY
>       nfsd: use RCU to protect fi_deleg_file
>       nfsd: add data structures for handling CB_NOTIFY
>       nfsd: add notification handlers for dir events
>       nfsd: add tracepoint to dir_event handler
>       nfsd: apply the notify mask to the delegation when requested
>       nfsd: add helper to marshal a fattr4 from completed args
>       nfsd: allow nfsd4_encode_fattr4_change() to work with no export
>       nfsd: send basic file attributes in CB_NOTIFY
>       nfsd: allow encoding a filehandle into fattr4 without a svc_fh
>       nfsd: add a fi_connectable flag to struct nfs4_file
>       nfsd: add the filehandle to returned attributes in CB_NOTIFY
>       nfsd: properly track requested child attributes
>       nfsd: track requested dir attributes
>       nfsd: add support to CB_NOTIFY for dir attribute changes
> 
>  Documentation/sunrpc/xdr/nfs4_1.x    | 264 ++++++++++++++-
>  fs/attr.c                            |   2 +-
>  fs/locks.c                           |  89 +++++-
>  fs/namei.c                           |  31 +-
>  fs/nfsd/filecache.c                  |  57 +++-
>  fs/nfsd/nfs4callback.c               |  60 +++-
>  fs/nfsd/nfs4layouts.c                |   5 +-
>  fs/nfsd/nfs4proc.c                   |  15 +
>  fs/nfsd/nfs4state.c                  | 524 ++++++++++++++++++++++++++----
>  fs/nfsd/nfs4xdr.c                    | 300 ++++++++++++++---
>  fs/nfsd/nfs4xdr_gen.c                | 601 ++++++++++++++++++++++++++++++++++-
>  fs/nfsd/nfs4xdr_gen.h                |  20 +-
>  fs/nfsd/state.h                      |  70 +++-
>  fs/nfsd/trace.h                      |  21 ++
>  fs/nfsd/xdr4.h                       |   5 +
>  fs/nfsd/xdr4cb.h                     |  12 +
>  fs/notify/mark.c                     |  29 ++
>  fs/posix_acl.c                       |   4 +-
>  fs/xattr.c                           |   4 +-
>  include/linux/filelock.h             |  54 +++-
>  include/linux/fsnotify_backend.h     |   1 +
>  include/linux/nfs4.h                 | 127 --------
>  include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
>  include/trace/events/filelock.h      |  38 ++-
>  include/uapi/linux/nfs4.h            |   2 -
>  25 files changed, 2321 insertions(+), 305 deletions(-)
> ---
> base-commit: bd5b9fd5e3d55bc412cec4bebe5a11da2151de4a
> change-id: 20260325-dir-deleg-339066dd1017
> 
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Jan Kara @ 2026-04-08 13:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-8-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:21, Jeff Layton wrote:
> Add a new helper function that will update the mask on the nfsd_file's
> fsnotify_mark to be a union of all current directory delegations on an
> inode. Call that when directory delegations are added or removed.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/nfsd/nfs4state.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c8fb84c38637..9a4cff08c67d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1258,6 +1258,37 @@ static void nfsd4_finalize_deleg_timestamps(struct nfs4_delegation *dp, struct f
>  	}
>  }
>  
> +static void nfsd_fsnotify_recalc_mask(struct nfsd_file *nf)
> +{
> +	struct fsnotify_mark *mark = &nf->nf_mark->nfm_mark;
> +	struct inode *inode = file_inode(nf->nf_file);
> +	u32 lease_mask, set = 0, clear = 0;
> +
> +	/* This is only needed when adding or removing dir delegs */
> +	if (!S_ISDIR(inode->i_mode))
> +		return;
> +
> +	/* Set up notifications for any ignored delegation events */
> +	lease_mask = inode_lease_ignore_mask(inode);
> +
> +	if (lease_mask & FL_IGN_DIR_CREATE)
> +		set |= FS_CREATE;
> +	else
> +		clear |= FS_CREATE;
> +
> +	if (lease_mask & FL_IGN_DIR_DELETE)
> +		set |= FS_DELETE;
> +	else
> +		clear |= FS_DELETE;
> +
> +	if (lease_mask & FL_IGN_DIR_RENAME)
> +		set |= FS_RENAME;
> +	else
> +		clear |= FS_RENAME;
> +
> +	fsnotify_modify_mark_mask(mark, set, clear);
> +}
> +
>  static void nfs4_unlock_deleg_lease(struct nfs4_delegation *dp)
>  {
>  	struct nfs4_file *fp = dp->dl_stid.sc_file;
> @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct nfs4_delegation *dp)
>  	WARN_ON_ONCE(!fp->fi_delegees);
>  
>  	nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> +	nfsd_fsnotify_recalc_mask(nf);
>  	kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
>  	put_deleg_file(fp);
>  }
> @@ -9652,6 +9684,7 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
>  
>  	if (!status) {
>  		put_nfs4_file(fp);
> +		nfsd_fsnotify_recalc_mask(nf);
>  		return dp;
>  	}
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 03/24] filelock: add an inode_lease_ignore_mask helper
From: Jan Kara @ 2026-04-08 13:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-3-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:16, Jeff Layton wrote:
> Add a new routine that returns a mask of all dir change events that are
> currently ignored by any leases. nfsd will use this to determine how to
> configure the fsnotify_mark mask.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/locks.c               | 32 ++++++++++++++++++++++++++++++++
>  include/linux/filelock.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 5af6dca2d46c..04980b065734 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1597,6 +1597,38 @@ any_leases_conflict(struct inode *inode, struct file_lease *breaker)
>  	return false;
>  }
>  
> +#define IGNORE_MASK	(FL_IGN_DIR_CREATE | FL_IGN_DIR_DELETE | FL_IGN_DIR_RENAME)
> +
> +/**
> + * inode_lease_ignore_mask - return union of all ignored inode events for this inode
> + * @inode: inode of which to get ignore mask
> + *
> + * Walk the list of leases, and return the result of all of
> + * their FL_IGN_DIR_* bits or'ed together.
> + */
> +u32
> +inode_lease_ignore_mask(struct inode *inode)
> +{
> +	struct file_lock_context *ctx;
> +	struct file_lock_core *flc;
> +	u32 mask = 0;
> +
> +	ctx = locks_inode_context(inode);
> +	if (!ctx)
> +		return 0;
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		mask |= flc->flc_flags & IGNORE_MASK;
> +		/* If we already have everything, we can stop */
> +		if (mask == IGNORE_MASK)
> +			break;
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +	return mask;
> +}
> +EXPORT_SYMBOL_GPL(inode_lease_ignore_mask);
> +
>  static bool
>  ignore_dir_deleg_break(struct file_lease *fl, unsigned int flags)
>  {
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 5a19cdb047da..416483b136f1 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -236,6 +236,7 @@ int generic_setlease(struct file *, int, struct file_lease **, void **priv);
>  int kernel_setlease(struct file *, int, struct file_lease **, void **);
>  int vfs_setlease(struct file *, int, struct file_lease **, void **);
>  int lease_modify(struct file_lease *, int, struct list_head *);
> +u32 inode_lease_ignore_mask(struct inode *inode);
>  
>  struct notifier_block;
>  int lease_register_notifier(struct notifier_block *);
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 07/24] vfs: add fsnotify_modify_mark_mask()
From: Jan Kara @ 2026-04-08 13:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-7-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:20, Jeff Layton wrote:
> nfsd needs to be able to modify the mask on an existing mark when new
> directory delegations are set or unset. Add an exported function that
> allows the caller to set and clear bits in the mark->mask, and does
> the recalculation if something changed.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  fs/notify/mark.c                 | 29 +++++++++++++++++++++++++++++
>  include/linux/fsnotify_backend.h |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index c2ed5b11b0fe..b1e73c6fd382 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -310,6 +310,35 @@ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>  		fsnotify_conn_set_children_dentry_flags(conn);
>  }
>  
> +/**
> + * fsnotify_modify_mark_mask - set and/or clear flags in a mark's mask
> + * @mark: mark to be modified
> + * @set: bits to be set in mask
> + * @clear: bits to be cleared in mask
> + *
> + * Modify a fsnotify_mark mask as directed, and update its associated conn.
> + * The caller is expected to hold a reference to the mark.
> + */
> +void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear)
> +{
> +	bool recalc = false;
> +	u32 mask;
> +
> +	WARN_ON_ONCE(clear & set);
> +
> +	spin_lock(&mark->lock);
> +	mask = mark->mask;
> +	mark->mask |= set;
> +	mark->mask &= ~clear;
> +	if (mark->mask != mask)
> +		recalc = true;
> +	spin_unlock(&mark->lock);
> +
> +	if (recalc)
> +		fsnotify_recalc_mask(mark->connector);
> +}
> +EXPORT_SYMBOL_GPL(fsnotify_modify_mark_mask);
> +
>  /* Free all connectors queued for freeing once SRCU period ends */
>  static void fsnotify_connector_destroy_workfn(struct work_struct *work)
>  {
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 95985400d3d8..66e185bd1b1b 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -917,6 +917,7 @@ extern void fsnotify_get_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_put_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info);
>  extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
> +extern void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear);
>  
>  static inline void fsnotify_init_event(struct fsnotify_event *event)
>  {
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 02/24] filelock: add a tracepoint to start of break_lease()
From: Jan Kara @ 2026-04-08 13:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-2-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:15, Jeff Layton wrote:
> ...mostly to show the LEASE_BREAK_* flags.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

OK. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/locks.c                      |  2 ++
>  include/trace/events/filelock.h | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index dafa0752fdce..5af6dca2d46c 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1654,6 +1654,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
>  	bool want_write = !(flags & LEASE_BREAK_OPEN_RDONLY);
>  	int error = 0;
>  
> +	trace_break_lease(inode, flags);
> +
>  	if (flags & LEASE_BREAK_LEASE)
>  		type = FL_LEASE;
>  	else if (flags & LEASE_BREAK_DELEG)
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index ef4bb0afb86a..fff0ee2d452d 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -120,6 +120,39 @@ DEFINE_EVENT(filelock_lock, flock_lock_inode,
>  		TP_PROTO(struct inode *inode, struct file_lock *fl, int ret),
>  		TP_ARGS(inode, fl, ret));
>  
> +#define show_lease_break_flags(val)					\
> +	__print_flags(val, "|",						\
> +		{ LEASE_BREAK_LEASE,		"LEASE" },		\
> +		{ LEASE_BREAK_DELEG,		"DELEG" },		\
> +		{ LEASE_BREAK_LAYOUT,		"LAYOUT" },		\
> +		{ LEASE_BREAK_NONBLOCK,		"NONBLOCK" },		\
> +		{ LEASE_BREAK_OPEN_RDONLY,	"OPEN_RDONLY" },	\
> +		{ LEASE_BREAK_DIR_CREATE,	"DIR_CREATE" },		\
> +		{ LEASE_BREAK_DIR_DELETE,	"DIR_DELETE" },		\
> +		{ LEASE_BREAK_DIR_RENAME,	"DIR_RENAME" })
> +
> +TRACE_EVENT(break_lease,
> +	TP_PROTO(struct inode *inode, unsigned int flags),
> +
> +	TP_ARGS(inode, flags),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, i_ino)
> +		__field(dev_t, s_dev)
> +		__field(unsigned int, flags)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->s_dev = inode->i_sb->s_dev;
> +		__entry->i_ino = inode->i_ino;
> +		__entry->flags = flags;
> +	),
> +
> +	TP_printk("dev=0x%x:0x%x ino=0x%lx flags=%s",
> +		  MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> +		  __entry->i_ino, show_lease_break_flags(__entry->flags))
> +);
> +
>  DECLARE_EVENT_CLASS(filelock_lease,
>  	TP_PROTO(struct inode *inode, struct file_lease *fl),
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Jan Kara @ 2026-04-08 13:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-1-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:14, Jeff Layton wrote:
> If a NFS client requests a directory delegation with a notification
> bitmask covering directory change events, the server shouldn't recall
> the delegation. Instead the client will be notified of the change after
> the fact.
> 
> Add support for ignoring lease breaks on directory changes. Add a new
> flags parameter to try_break_deleg() and teach __break_lease how to
> ignore certain types of delegation break events.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -222,6 +225,10 @@ struct file_lease *locks_alloc_lease(void);
>  #define LEASE_BREAK_LAYOUT		BIT(2)	// break layouts only
>  #define LEASE_BREAK_NONBLOCK		BIT(3)	// non-blocking break
>  #define LEASE_BREAK_OPEN_RDONLY		BIT(4)	// readonly open event
> +#define LEASE_BREAK_DIR_CREATE		BIT(6)  // dir deleg create event
> +#define LEASE_BREAK_DIR_DELETE		BIT(7)  // dir deleg delete event
> +#define LEASE_BREAK_DIR_RENAME		BIT(8)  // dir deleg rename event

Just curious why you've left out bit 5 here... :)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH] perf: enable unprivileged syscall tracing with perf trace
From: Anubhav Shelat @ 2026-04-08 12:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-perf-users,
	linux-kernel, linux-trace-kernel
  Cc: Anubhav Shelat

Allow unprivileged users to trace their own processes' syscalls using
perf trace, similar to strace without the intrusive overhead of ptrace().

Currently, perf trace requires CAP_PERFMON or paranoid level ≤ 1 even
though the kernel has existing infrastructure (TRACE_EVENT_FL_CAP_ANY)
specifically designed to mark syscall tracepoints as safe for
unprivileged access. To fix this:

1. Loosen the condition in perf_event_open() which requires priviliges
for all events with exclude_kernel=0. This allows perf_event_open() to
bypass the paranoid check for task-attached tracepoint events.

2. Make the format and id tracefs files world-readable only for tracepoints
   with TRACE_EVENT_FL_CAP_ANY, allowing unprivileged users to see
syscall tracepoint ids without exposing sensitive information.

Example usage after this change:
  $ perf trace ls          # works as unprivileged user
  $ perf trace             # system-wide, still requires privileges
  $ perf trace -p 1234     # requires ptrace permission on pid 1234

Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 kernel/events/core.c        | 2 +-
 kernel/trace/trace_events.c | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..71d99ea4bea4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13833,7 +13833,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (err)
 		return err;
 
-	if (!attr.exclude_kernel) {
+	if (!attr.exclude_kernel && !(attr.type == PERF_TYPE_TRACEPOINT && pid != -1)) {
 		err = perf_allow_kernel();
 		if (err)
 			return err;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 249d1cba72c0..6250b2529376 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3051,7 +3051,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 	struct trace_event_call *call = file->event_call;
 
 	if (strcmp(name, "format") == 0) {
-		*mode = TRACE_MODE_READ;
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+			(TRACE_MODE_READ | 0004) :
+			TRACE_MODE_READ;
 		*fops = &ftrace_event_format_fops;
 		return 1;
 	}
@@ -3087,7 +3089,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 #ifdef CONFIG_PERF_EVENTS
 	if (call->event.type && call->class->reg &&
 	    strcmp(name, "id") == 0) {
-		*mode = TRACE_MODE_READ;
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+		(TRACE_MODE_READ | 0004) :
+		TRACE_MODE_READ;
 		*data = (void *)(long)call->event.type;
 		*fops = &ftrace_event_id_fops;
 		return 1;
-- 
2.53.0


^ permalink raw reply related

* Re: [GIT PULL] rv changes for v7.1 (for-next)
From: Gabriele Monaco @ 2026-04-08 12:27 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <20260401151706.415601-2-gmonaco@redhat.com>

Hi Steve,

gentle ping: could you have a look at this pull request?

Thanks,
Gabriele

On Wed, 2026-04-01 at 17:17 +0200, Gabriele Monaco wrote:
> Steve,
> 
> The following changes since commit 7aaa8047eafd0bd628065b15757d9b48c5f9c07d:
> 
>   Linux 7.0-rc6 (2026-03-29 15:40:00 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/gmonaco/linux.git rv-7.1-next
> 
> for you to fetch changes up to 00f0dadde8c5036fe6462621a6920549036dce70:
> 
>   rv: Allow epoll in rtapp-sleep monitor (2026-04-01 15:18:30 +0200)
> 
> ----------------------------------------------------------------
> rv changes for v7.1 (for-next)
> 
> Summary of changes:
> 
> - Refactor da_monitor header to share handlers across monitor types
> 
>   No functional changes, only less code duplication.
> 
> - Add Hybrid Automata model class
> 
>   Add a new model class that extends deterministic automata by adding
>   constraints on transitions and states. Those constraints can take into
>   account wall-clock time and as such allow RV monitor to make
>   assertions on real time. Add documentation and code generation
>   scripts.
> 
> - Add stall monitor as hybrid automaton example
> 
>   Add a monitor that triggers a violation when a task is stalling as an
>   example of automaton working with real time variables.
> 
> - Convert the opid monitor to a hybrid automaton
> 
>   The opid monitor can be heavily simplified if written as a hybrid
>   automaton: instead of tracking preempt and interrupt enable/disable
>   events, it can just run constraints on the preemption/interrupt
>   states when events like wakeup and need_resched verify.
> 
> - Add support for per-object monitors in DA/HA
> 
>   Allow writing deterministic and hybrid automata monitors for generic
>   objects (e.g. any struct), by exploiting a hash table where objects
>   are saved. This allows to track more than just tasks in RV. For
>   instance it will be used to track deadline entities in deadline
>   monitors.
> 
> - Add deadline tracepoints and move some deadline utilities
> 
>   Prepare the ground for deadline monitors by defining events and
>   exporting helpers.
> 
> - Add nomiss deadline monitor
> 
>   Add first example of deadline monitor asserting all entities complete
>   before their deadline.
> 
> - Improve rvgen error handling
> 
>   Introduce AutomataError exception class and better handle expected
>   exceptions while showing a backtrace for unexpected ones.
> 
> - Improve python code quality in rvgen
> 
>   Refactor the rvgen generation scripts to align with python best
>   practices: use f-strings instead of %, use len() instead of __len__(),
>   remove semicolons, use context managers for file operations, fix
>   whitespace violations, extract magic strings into constants, remove
>   unused imports and methods.
> 
> - Fix small bugs in rvgen
> 
>   The generator scripts presented some corner case bugs: logical error in
>   validating what a correct dot file looks like, fix an isinstance()
>   check, enforce a dot file has an initial state, fix type annotations
>   and typos in comments.
> 
> - rvgen refactoring
> 
>   Refactor automata.py to use iterator-based parsing and handle required
>   arguments directly in argparse.
> 
> - Allow epoll in rtapp-sleep monitor
> 
>   The epoll_wait call is now rt-friendly so it should be allowed in the
>   sleep monitor as a valid sleep method.
> 
> ----------------------------------------------------------------
> Gabriele Monaco (12):
>       rv: Unify DA event handling functions across monitor types
>       rv: Add Hybrid Automata monitor type
>       verification/rvgen: Allow spaces in and events strings
>       verification/rvgen: Add support for Hybrid Automata
>       Documentation/rv: Add documentation about hybrid automata
>       rv: Add sample hybrid monitor stall
>       rv: Convert the opid monitor to a hybrid automaton
>       rv: Add support for per-object monitors in DA/HA
>       verification/rvgen: Add support for per-obj monitors
>       sched: Add deadline tracepoints
>       sched/deadline: Move some utility functions to deadline.h
>       rv: Add nomiss deadline monitor
> 
> Nam Cao (1):
>       rv: Allow epoll in rtapp-sleep monitor
> 
> Wander Lairson Costa (19):
>       rv/rvgen: introduce AutomataError exception class
>       rv/rvgen: remove bare except clauses in generator
>       rv/rvgen: replace % string formatting with f-strings
>       rv/rvgen: replace __len__() calls with len()
>       rv/rvgen: remove unnecessary semicolons
>       rv/rvgen: use context managers for file operations
>       rv/rvgen: fix typos in automata and generator docstring and comments
>       rv/rvgen: fix PEP 8 whitespace violations
>       rv/rvgen: fix DOT file validation logic error
>       rv/rvgen: use class constant for init marker
>       rv/rvgen: refactor automata.py to use iterator-based parsing
>       rv/rvgen: remove unused sys import from dot2c
>       rv/rvgen: remove unused __get_main_name method
>       rv/rvgen: make monitor arguments required in rvgen
>       rv/rvgen: fix isinstance check in Variable.expand()
>       rv/rvgen: extract node marker string to class constant
>       rv/rvgen: enforce presence of initial state
>       rv/rvgen: fix unbound loop variable warning
>       rv/rvgen: fix _fill_states() return type annotation
> 
>  Documentation/tools/rv/index.rst                   |   1 +
>  Documentation/tools/rv/rv-mon-stall.rst            |  44 ++
>  Documentation/trace/rv/deterministic_automata.rst  |   2 +-
>  Documentation/trace/rv/hybrid_automata.rst         | 341 +++++++++++
>  Documentation/trace/rv/index.rst                   |   3 +
>  Documentation/trace/rv/monitor_deadline.rst        |  84 +++
>  Documentation/trace/rv/monitor_sched.rst           |  62 +-
>  Documentation/trace/rv/monitor_stall.rst           |  43 ++
>  Documentation/trace/rv/monitor_synthesis.rst       | 117 +++-
>  include/linux/rv.h                                 |  39 ++
>  include/linux/sched/deadline.h                     |  27 +
>  include/rv/da_monitor.h                            | 644 +++++++++++++++-----
> -
>  include/rv/ha_monitor.h                            | 478 +++++++++++++++
>  include/trace/events/sched.h                       |  26 +
>  kernel/sched/core.c                                |   5 +
>  kernel/sched/deadline.c                            |  51 +-
>  kernel/trace/rv/Kconfig                            |  18 +
>  kernel/trace/rv/Makefile                           |   3 +
>  kernel/trace/rv/monitors/deadline/Kconfig          |  10 +
>  kernel/trace/rv/monitors/deadline/deadline.c       |  44 ++
>  kernel/trace/rv/monitors/deadline/deadline.h       | 202 +++++++
>  kernel/trace/rv/monitors/nomiss/Kconfig            |  15 +
>  kernel/trace/rv/monitors/nomiss/nomiss.c           | 293 ++++++++++
>  kernel/trace/rv/monitors/nomiss/nomiss.h           | 123 ++++
>  kernel/trace/rv/monitors/nomiss/nomiss_trace.h     |  19 +
>  kernel/trace/rv/monitors/opid/Kconfig              |  11 +-
>  kernel/trace/rv/monitors/opid/opid.c               | 111 ++--
>  kernel/trace/rv/monitors/opid/opid.h               |  86 +--
>  kernel/trace/rv/monitors/opid/opid_trace.h         |   4 +
>  kernel/trace/rv/monitors/sleep/sleep.c             |   8 +
>  kernel/trace/rv/monitors/sleep/sleep.h             |  98 ++--
>  kernel/trace/rv/monitors/stall/Kconfig             |  13 +
>  kernel/trace/rv/monitors/stall/stall.c             | 150 +++++
>  kernel/trace/rv/monitors/stall/stall.h             |  81 +++
>  kernel/trace/rv/monitors/stall/stall_trace.h       |  19 +
>  kernel/trace/rv/rv_trace.h                         |  67 ++-
>  tools/verification/models/deadline/nomiss.dot      |  41 ++
>  tools/verification/models/rtapp/sleep.ltl          |   1 +
>  tools/verification/models/sched/opid.dot           |  36 +-
>  tools/verification/models/stall.dot                |  22 +
>  tools/verification/rvgen/__main__.py               |  27 +-
>  tools/verification/rvgen/dot2c                     |   1 -
>  tools/verification/rvgen/rvgen/automata.py         | 294 +++++++---
>  tools/verification/rvgen/rvgen/dot2c.py            | 105 +++-
>  tools/verification/rvgen/rvgen/dot2k.py            | 524 ++++++++++++++++-
>  tools/verification/rvgen/rvgen/generator.py        |  93 ++-
>  tools/verification/rvgen/rvgen/ltl2ba.py           |  11 +-
>  tools/verification/rvgen/rvgen/ltl2k.py            |  54 +-
>  .../rvgen/rvgen/templates/dot2k/main.c             |   2 +-
>  .../rvgen/rvgen/templates/dot2k/trace_hybrid.h     |  16 +
>  50 files changed, 3883 insertions(+), 686 deletions(-)
>  create mode 100644 Documentation/tools/rv/rv-mon-stall.rst
>  create mode 100644 Documentation/trace/rv/hybrid_automata.rst
>  create mode 100644 Documentation/trace/rv/monitor_deadline.rst
>  create mode 100644 Documentation/trace/rv/monitor_stall.rst
>  create mode 100644 include/rv/ha_monitor.h
>  create mode 100644 kernel/trace/rv/monitors/deadline/Kconfig
>  create mode 100644 kernel/trace/rv/monitors/deadline/deadline.c
>  create mode 100644 kernel/trace/rv/monitors/deadline/deadline.h
>  create mode 100644 kernel/trace/rv/monitors/nomiss/Kconfig
>  create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.c
>  create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.h
>  create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss_trace.h
>  create mode 100644 kernel/trace/rv/monitors/stall/Kconfig
>  create mode 100644 kernel/trace/rv/monitors/stall/stall.c
>  create mode 100644 kernel/trace/rv/monitors/stall/stall.h
>  create mode 100644 kernel/trace/rv/monitors/stall/stall_trace.h
>  create mode 100644 tools/verification/models/deadline/nomiss.dot
>  create mode 100644 tools/verification/models/stall.dot
>  create mode 100644
> tools/verification/rvgen/rvgen/templates/dot2k/trace_hybrid.h
> 
> To: Steven Rostedt <rostedt@goodmis.org>
> Cc: Gabriele Monaco <gmonaco@redhat.com>
> Cc: Nam Cao <namcao@linutronix.de>
> Cc: Wander Lairson Costa <wander@redhat.com>


^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Petr Mladek @ 2026-04-08 11:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Song Liu, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <CALOAHbDG9mq1iJv5suct=cqJ+2r8VvJ-dXN=nuvMw0XYqnUjxA@mail.gmail.com>

On Wed 2026-04-08 10:40:10, Yafang Shao wrote:
> On Tue, Apr 7, 2026 at 11:08 PM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Tue 2026-04-07 17:45:31, Yafang Shao wrote:
> > > On Tue, Apr 7, 2026 at 11:16 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
> > > > >
> > > > > On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> > > > > [...]
> > > > > > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > > > > >   are replaceable.
> > > > > > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > > > > >   and are not replaceable.
> > > > > > > > >
> > > > > > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > > > > > livepatch. Is this the expected use case?
> > > > > > > >
> > > > > > > > That matches our expected use case.
> > > > > > >
> > > > > > > If we really want to serve use cases like this, I think we can introduce
> > > > > > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > > > > > Newly loaded livepatch will only replace existing livepatch with the
> > > > > > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > > > > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > > > > > replace tag.
> > > > > > >
> > > > > > > For current users of cumulative patches, all the livepatch will have the
> > > > > > > same tag, say 1. For your use case, you can assign each user a
> > > > > > > unique tag. Then all these users can do atomic upgrades of their
> > > > > > > own livepatches.
> > > > > > >
> > > > > > > We may also need to check whether two livepatches of different tags
> > > > > > > touch the same kernel function. When that happens, the later
> > > > > > > livepatch should fail to load.
> >
> > I still think how to make the hybrid mode more secure:
> >
> >     + The isolated sets of livepatched functions look like a good rule.
> >     + What about isolating the shadow variables/states as well?
> 
> We might consider extending the klp_shadow_* API to support the new
> livepatch tag.

It would be nice to associate shadow variables with states so that
we could check which shadow variables are used by each livepatch.

It is partially implemented in my earlier RFC, see
https://lore.kernel.org/all/20250115082431.5550-3-pmladek@suse.com/


> > > > That sounds like a viable solution. I'll look into it and see how we
> > > > can implement it.
> > >
> > > Does the following change look good to you ?
> > >
> > > Subject: [PATCH] livepatch: Support scoped atomic replace using replace tags
> > >
> > > Extend the replace attribute from a boolean to a u32 to act as a replace
> > > tag. This introduces the following semantics:
> > >
> > >   replace = 0: Atomic replace is disabled. However, this patch remains
> > >                eligible to be superseded by others.
> > >   replace > 0: Enables tagged replace (default is 1). A newly loaded
> > >                livepatch will only replace existing patches that share the
> > >                same tag.
> > >
> > > To maintain backward compatibility, a patch with replace == 0 does not
> > > trigger an outgoing atomic replace, but remains eligible to be superseded
> > > by any incoming patch with a valid replace tag.
> > >
> > > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> > > index ba9e3988c07c..417c67a17b99 100644
> > > --- a/include/linux/livepatch.h
> > > +++ b/include/linux/livepatch.h
> > > @@ -123,7 +123,11 @@ struct klp_state {
> > >   * @mod:       reference to the live patch module
> > >   * @objs:      object entries for kernel objects to be patched
> > >   * @states:    system states that can get modified
> > > - * @replace:   replace all actively used patches
> > > + * @replace:   replace tag:
> > > + *             = 0: Atomic replace is disabled; however, this patch remains
> > > + *                  eligible to be superseded by others.
> >
> > This is weird semantic. Which livepatch tag would be allowed to
> > supersede it, please?
> >
> > Do we still need this category?
> 
> It can be superseded by any livepatch that has a non-zero tag set.

And this exactly the weird thing.

A patch with the .replace flag set is supposed to obsolete all already
installed livepatches. It means that it should provide all existing
fixes and features.

Now, we want to introduce a replace flag/set which would allow to
replace/obsolete only the livepatch with the same tag/set number.
And we want to prevent conflicts by making sure that livepatches with
different tag/set number will never livepatch the same function.

Obviously, livepatches with different tag/set number could not
obsolete the same no-replace livepatch. They would need to livepatch
the same functions touched by the no-replace livepatch and would
conflict.

So, I suggest to remove the no-replace mode completely. It should
not be needed. A livepatch which should be installed in parallel
will simply use another unique tag/set number.

> This ensures backward compatibility: while a non-atomic-replace
> livepatch can be superseded by an atomic-replace one, the reverse is
> not permitted—an atomic-replace livepatch cannot be superseded by a
> non-atomic one.

IMHO, the backward compatibility would just create complexity and mess
in this case.

> > > + *             > 0: Atomic replace is enabled. Only existing patches with a
> > > + *                  matching replace tag will be superseded.
> > >   * @list:      list node for global list of actively used patches
> > >   * @kobj:      kobject for sysfs resources
> > >   * @obj_list:  dynamic list of the object entries
> > > @@ -137,7 +141,7 @@ struct klp_patch {
> > >         struct module *mod;
> > >         struct klp_object *objs;
> > >         struct klp_state *states;
> > > -       bool replace;
> > > +       unsigned int replace;
> >
> > This already breaks the backward compatibility
> 
> It doesn't break backward compatibility.

It does. Livepatches with .replace flag set would need to define:

	struct livepatch patch = {
		.replace = <number>,
	}

instead of

	struct livepatch patch = {
		.replace = true,
	}

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH] seq_buf: add KUnit tests for seq_buf_putmem_hex()
From: kernel test robot @ 2026-04-08 11:39 UTC (permalink / raw)
  To: Shuvam Pandey, Andrew Morton, Steven Rostedt
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Shuvam Pandey
In-Reply-To: <20260406033728.25998-1-shuvampandey1@gmail.com>

Hi Shuvam,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-nonmm-unstable]
[also build test ERROR on akpm-mm/mm-everything linus/master v7.0-rc7 next-20260407]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shuvam-Pandey/seq_buf-add-KUnit-tests-for-seq_buf_putmem_hex/20260408-090623
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link:    https://lore.kernel.org/r/20260406033728.25998-1-shuvampandey1%40gmail.com
patch subject: [PATCH] seq_buf: add KUnit tests for seq_buf_putmem_hex()
config: arm-randconfig-002-20260408 (https://download.01.org/0day-ci/archive/20260408/202604081907.Ko14RCaV-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project c80443cd37b2e2788cba67ffa180a6331e5f0791)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260408/202604081907.Ko14RCaV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604081907.Ko14RCaV-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm/probes/kprobes/test-kprobes.o
>> ERROR: modpost: "seq_buf_putmem_hex" [lib/tests/seq_buf_kunit.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [RFC v6 7/7] ext4: fast commit: export snapshot stats in fc_info
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Start consuming locked_ns in fc_info, so this patch intentionally moves
  lock_updates_ns_{total,max,samples} accounting here.
- Guard the tracepoint call with trace_ext4_fc_lock_updates_enabled() and
  use trace_call__ext4_fc_lock_updates() to avoid the double static_branch
  at the guarded call site.
- keeps the stats unconditionally while avoiding extra tracepoint
  overhead when ext4_fc_lock_updates is disabled.

 fs/ext4/ext4.h        | 31 +++++++++++++++++++
 fs/ext4/fast_commit.c | 72 +++++++++++++++++++++++++++++++++++++------
 fs/ext4/super.c       |  1 +
 3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1ff6ea1bde3e..c9ed7ceca982 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1554,6 +1554,36 @@ struct ext4_orphan_info {
 						 * file blocks */
 };
 
+/*
+ * Ext4 fast commit snapshot statistics.
+ *
+ * These are best-effort counters intended for debugging / performance
+ * introspection; they are not exact under concurrent updates.
+ */
+struct ext4_fc_snap_stats {
+	u64 lock_updates_ns_total;
+	u64 lock_updates_ns_max;
+	u64 lock_updates_samples;
+
+	u64 snap_inodes;
+	u64 snap_ranges;
+
+	u64 snap_fail_es_miss;
+	u64 snap_fail_es_delayed;
+	u64 snap_fail_es_other;
+
+	u64 snap_fail_inodes_cap;
+	u64 snap_fail_ranges_cap;
+	u64 snap_fail_nomem;
+	u64 snap_fail_inode_loc;
+
+	/*
+	 * Missing inode snapshots during log writing should never happen.
+	 * Keep this counter to help catch unexpected regressions.
+	 */
+	u64 snap_fail_no_snap;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1828,6 +1858,7 @@ struct ext4_sb_info {
 	struct mutex s_fc_lock;
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
+	struct ext4_fc_snap_stats s_fc_snap_stats;
 	tid_t s_fc_ineligible_tid;
 #ifdef CONFIG_EXT4_DEBUG
 	int s_fc_debug_max_replay;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 6ac6ebe79d7b..3c6ace2b0b94 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -890,13 +890,17 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	int inode_len;
 	int ret;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	src = snap->inode_buf;
 	inode_len = snap->inode_len;
-	if (!src || inode_len == 0)
+	if (!src || inode_len == 0) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -931,8 +935,10 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 	struct ext4_extent *ex;
 	struct ext4_fc_range *range;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	list_for_each_entry(range, &snap->data_list, list) {
 		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
@@ -993,6 +999,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
 	unsigned int nr_ranges = 0;
 
@@ -1019,11 +1027,13 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		ext4_lblk_t len;
 
 		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			stats->snap_fail_es_miss++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
 		}
 
 		if (ext4_es_is_delayed(&es)) {
+			stats->snap_fail_es_delayed++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
@@ -1038,6 +1048,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		}
 
 		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			stats->snap_fail_ranges_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
@@ -1045,6 +1056,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range) {
+			stats->snap_fail_nomem++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
 		}
@@ -1072,6 +1084,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			stats->snap_fail_es_other++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
@@ -1092,6 +1105,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
@@ -1102,6 +1117,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
 	if (ret) {
+		stats->snap_fail_inode_loc++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
 	}
@@ -1113,6 +1129,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		stats->snap_fail_nomem++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
@@ -1137,6 +1154,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	stats->snap_inodes++;
+	stats->snap_ranges += nr_ranges;
 	if (nr_rangesp)
 		*nr_rangesp = nr_ranges;
 	return 0;
@@ -1246,6 +1265,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1271,6 +1291,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1314,6 +1335,7 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
@@ -1376,8 +1398,13 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 		return ret;
 
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
-	if (ret)
+	if (ret) {
+		if (ret == -E2BIG)
+			snap_stats->snap_fail_inodes_cap++;
+		else if (ret == -ENOMEM)
+			snap_stats->snap_fail_nomem++;
 		return ret;
+	}
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
@@ -1398,12 +1425,15 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
 				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
-	if (trace_ext4_fc_lock_updates_enabled()) {
-		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
-		trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
-					   snap_inodes, snap_ranges, ret,
-					   snap_err);
-	}
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	snap_stats->lock_updates_ns_total += locked_ns;
+	snap_stats->lock_updates_samples++;
+	if (locked_ns > snap_stats->lock_updates_ns_max)
+		snap_stats->lock_updates_ns_max = locked_ns;
+	if (trace_ext4_fc_lock_updates_enabled())
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -2704,11 +2734,17 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 {
 	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
 	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
+	u64 lock_avg_ns = 0;
 	int i;
 
 	if (v != SEQ_START_TOKEN)
 		return 0;
 
+	if (snap_stats->lock_updates_samples)
+		lock_avg_ns = div_u64(snap_stats->lock_updates_ns_total,
+				      snap_stats->lock_updates_samples);
+
 	seq_printf(seq,
 		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
 		   stats->fc_num_commits, stats->fc_ineligible_commits,
@@ -2719,6 +2755,22 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
 			stats->fc_ineligible_reason_count[i]);
 
+	seq_printf(seq,
+		   "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n%lluus lock_updates_max\n",
+		   snap_stats->snap_inodes, snap_stats->snap_ranges,
+		   div_u64(lock_avg_ns, 1000),
+		   div_u64(snap_stats->lock_updates_ns_max, 1000));
+	seq_printf(seq,
+		   "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%llu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n",
+		   snap_stats->snap_fail_es_miss,
+		   snap_stats->snap_fail_es_delayed,
+		   snap_stats->snap_fail_es_other,
+		   snap_stats->snap_fail_inodes_cap,
+		   snap_stats->snap_fail_ranges_cap,
+		   snap_stats->snap_fail_nomem,
+		   snap_stats->snap_fail_inode_loc,
+		   snap_stats->snap_fail_no_snap);
+
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 286f05834900..9ae68a223ea6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4538,6 +4538,7 @@ static void ext4_fast_commit_init(struct super_block *sb)
 	sbi->s_fc_ineligible_tid = 0;
 	mutex_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats));
 	sbi->s_fc_replay_state.fc_regions = NULL;
 	sbi->s_fc_replay_state.fc_regions_size = 0;
 	sbi->s_fc_replay_state.fc_regions_used = 0;
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
  Cc: Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Drop explicit ext4_fc_snap_err assignments and rely on enum
  auto-increment.
- Treat locked_ns as trace-only in this patch and calculate it only when
  ext4_fc_lock_updates is enabled, as suggested by Steven Rostedt.

 fs/ext4/ext4.h              | 15 ++++++++
 fs/ext4/fast_commit.c       | 74 +++++++++++++++++++++++++++++--------
 include/trace/events/ext4.h | 61 ++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 13fe4fdf9bda..1ff6ea1bde3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1028,6 +1028,21 @@ enum {
 
 struct ext4_fc_inode_snap;
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE = 0,
+	EXT4_FC_SNAP_ERR_ES_MISS,
+	EXT4_FC_SNAP_ERR_ES_DELAYED,
+	EXT4_FC_SNAP_ERR_ES_OTHER,
+	EXT4_FC_SNAP_ERR_INODES_CAP,
+	EXT4_FC_SNAP_ERR_RANGES_CAP,
+	EXT4_FC_SNAP_ERR_NOMEM,
+	EXT4_FC_SNAP_ERR_INODE_LOC,
+};
+
 /*
  * fourth extended file system inode data in memory
  */
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index ab751b855afa..6ac6ebe79d7b 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -193,6 +193,12 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -983,11 +989,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -1011,11 +1018,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		struct ext4_fc_range *range;
 		ext4_lblk_t len;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > end_lblk - cur_lblk + 1)
@@ -1025,12 +1037,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1055,6 +1072,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1071,7 +1089,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1083,8 +1101,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1093,6 +1113,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1103,7 +1124,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1204,7 +1225,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1222,6 +1246,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1245,6 +1271,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1269,16 +1297,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = i;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1287,10 +1319,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1338,13 +1375,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1358,8 +1395,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	if (trace_ext4_fc_lock_updates_enabled()) {
+		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+		trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+					   snap_inodes, snap_ranges, ret,
+					   snap_err);
+	}
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1564,7 +1608,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f493642cf121..7028a28316fa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
+#undef EM
+#undef EMe
+#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+
+#define TRACE_SNAP_ERR						\
+	EM(NONE)						\
+	EM(ES_MISS)						\
+	EM(ES_DELAYED)						\
+	EM(ES_OTHER)						\
+	EM(INODES_CAP)						\
+	EM(RANGES_CAP)						\
+	EM(NOMEM)						\
+	EMe(INODE_LOC)
+
+TRACE_SNAP_ERR
+
+#undef EM
+#undef EMe
+
 #define show_fc_reason(reason)						\
 	__print_symbolic(reason,					\
 		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
@@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
+#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
+
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __print_symbolic(__entry->snap_err,
+						 TRACE_SNAP_ERR))
+);
+
+#undef EM
+#undef EMe
+#undef TRACE_SNAP_ERR
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 253 +++++++++++++++++++++++++++++-------------
 1 file changed, 177 insertions(+), 76 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index f28e732e9be7..ab751b855afa 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -183,6 +183,15 @@
 
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
+static struct kmem_cache *ext4_fc_range_cachep;
+
+/*
+ * Avoid spending unbounded time/memory snapshotting highly fragmented files
+ * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to
+ * full commit.
+ */
+#define EXT4_FC_SNAPSHOT_MAX_INODES	1024
+#define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
@@ -954,7 +963,7 @@ static void ext4_fc_free_ranges(struct list_head *head)
 
 	list_for_each_entry_safe(range, range_n, head, list) {
 		list_del(&range->list);
-		kfree(range);
+		kmem_cache_free(ext4_fc_range_cachep, range);
 	}
 }
 
@@ -972,16 +981,19 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 }
 
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
-				       struct list_head *ranges)
+				       struct list_head *ranges,
+				       unsigned int nr_ranges_total,
+				       unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
-	struct ext4_map_blocks map;
-	int ret;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
 		spin_unlock(&ei->i_fc_lock);
+		if (nr_rangesp)
+			*nr_rangesp = 0;
 		return 0;
 	}
 	start_lblk = ei->i_fc_lblk_start;
@@ -995,61 +1007,78 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		   (unsigned long long)inode->i_ino);
 
 	while (cur_lblk <= end_lblk) {
+		struct extent_status es;
 		struct ext4_fc_range *range;
+		ext4_lblk_t len;
 
-		map.m_lblk = cur_lblk;
-		map.m_len = end_lblk - cur_lblk + 1;
-		ret = ext4_map_blocks(NULL, inode, &map,
-				      EXT4_GET_BLOCKS_IO_SUBMIT |
-				      EXT4_EX_NOCACHE);
-		if (ret < 0)
-			return -ECANCELED;
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+			return -EAGAIN;
+
+		if (ext4_es_is_delayed(&es))
+			return -EAGAIN;
 
-		if (map.m_len == 0) {
+		len = es.es_len - (cur_lblk - es.es_lblk);
+		if (len > end_lblk - cur_lblk + 1)
+			len = end_lblk - cur_lblk + 1;
+		if (len == 0) {
 			cur_lblk++;
 			continue;
 		}
 
-		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+			return -E2BIG;
+
+		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range)
 			return -ENOMEM;
+		nr_ranges++;
 
-		range->lblk = map.m_lblk;
-		range->len = map.m_len;
+		range->lblk = cur_lblk;
+		range->len = len;
 		range->pblk = 0;
 		range->unwritten = false;
 
-		if (ret == 0) {
+		if (ext4_es_is_hole(&es)) {
 			range->tag = EXT4_FC_TAG_DEL_RANGE;
-		} else {
-			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
-				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
-
-			/* Limit the number of blocks in one extent */
-			map.m_len = min(max, map.m_len);
+		} else if (ext4_es_is_written(&es) ||
+			   ext4_es_is_unwritten(&es)) {
+			unsigned int max;
 
 			range->tag = EXT4_FC_TAG_ADD_RANGE;
-			range->len = map.m_len;
-			range->pblk = map.m_pblk;
-			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
+			range->pblk = ext4_es_pblock(&es) +
+				      (cur_lblk - es.es_lblk);
+			range->unwritten = ext4_es_is_unwritten(&es);
+
+			max = range->unwritten ? EXT_UNWRITTEN_MAX_LEN :
+						 EXT_INIT_MAX_LEN;
+			if (range->len > max)
+				range->len = max;
+		} else {
+			kmem_cache_free(ext4_fc_range_cachep, range);
+			return -EAGAIN;
 		}
 
 		INIT_LIST_HEAD(&range->list);
 		list_add_tail(&range->list, ranges);
 
-		cur_lblk += map.m_len;
+		cur_lblk += range->len;
 	}
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-static int ext4_fc_snapshot_inode(struct inode *inode)
+static int ext4_fc_snapshot_inode(struct inode *inode,
+				  unsigned int nr_ranges_total,
+				  unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
 	LIST_HEAD(ranges);
+	unsigned int nr_ranges = 0;
 	int ret;
 	int alloc_ctx;
 
@@ -1073,7 +1102,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
 	brelse(iloc.bh);
 
-	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
+					  &nr_ranges);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1086,10 +1116,11 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
 {
@@ -1168,49 +1199,32 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
-static int ext4_fc_snapshot_inodes(journal_t *journal)
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp);
+
+static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
+				   unsigned int inodes_size)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *iter;
 	struct ext4_fc_dentry_update *fc_dentry;
-	struct inode **inodes;
-	unsigned int nr_inodes = 0;
 	unsigned int i = 0;
+	unsigned int idx;
+	unsigned int nr_ranges = 0;
 	int ret = 0;
 	int alloc_ctx;
 
-	alloc_ctx = ext4_fc_lock(sb);
-	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
-		nr_inodes++;
-
-	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
-		struct ext4_inode_info *ei;
-
-		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
-			continue;
-		if (list_empty(&fc_dentry->fcd_dilist))
-			continue;
-
-		/* See the comment in ext4_fc_commit_dentry_updates(). */
-		ei = list_first_entry(&fc_dentry->fcd_dilist,
-				      struct ext4_inode_info, i_fc_dilist);
-		if (!list_empty(&ei->i_fc_list))
-			continue;
-
-		nr_inodes++;
-	}
-	ext4_fc_unlock(sb, alloc_ctx);
-
-	if (!nr_inodes)
+	if (!inodes_size)
 		return 0;
 
-	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
-	if (!inodes)
-		return -ENOMEM;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		inodes[i++] = &iter->vfs_inode;
 	}
 
@@ -1230,6 +1244,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		/*
 		 * Create-only inodes may only be referenced via fcd_dilist and
 		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
@@ -1241,15 +1259,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 		inodes[i++] = inode;
 	}
+unlock:
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+	if (ret)
+		return ret;
+
+	for (idx = 0; idx < i; idx++) {
+		unsigned int inode_ranges = 0;
+
+		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
+					     &inode_ranges);
 		if (ret)
 			break;
+		nr_ranges += inode_ranges;
 	}
 
-	kvfree(inodes);
 	return ret;
 }
 
@@ -1260,6 +1285,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
+	struct inode **inodes;
+	unsigned int inodes_size;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
@@ -1312,6 +1339,10 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		return ret;
 
 
+	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
+	if (ret)
+		return ret;
+
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
 	/*
@@ -1327,8 +1358,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 	jbd2_journal_unlock_updates(journal);
+	kvfree(inodes);
 	if (ret)
 		return ret;
 
@@ -1384,6 +1416,64 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	return ret;
 }
 
+static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	unsigned int nr_inodes = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	return nr_inodes;
+}
+
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp)
+{
+	unsigned int nr_inodes = ext4_fc_count_snapshot_inodes(sb);
+	struct inode **inodes;
+
+	*inodesp = NULL;
+	*nr_inodesp = 0;
+
+	if (!nr_inodes)
+		return 0;
+
+	if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES)
+		return -E2BIG;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	*inodesp = inodes;
+	*nr_inodesp = nr_inodes;
+	return 0;
+}
+
 static void ext4_fc_update_stats(struct super_block *sb, int status,
 				 u64 commit_time, int nblks, tid_t commit_tid)
 {
@@ -1476,7 +1566,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
-		status = EXT4_FC_STATUS_FAILED;
+		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
+			status = EXT4_FC_STATUS_INELIGIBLE;
+		else
+			status = EXT4_FC_STATUS_FAILED;
 		goto fallback;
 	}
 	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
@@ -1560,34 +1653,35 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
 		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
-					     struct ext4_fc_dentry_update,
-					     fcd_list);
+						 struct ext4_fc_dentry_update,
+						 fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
 		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
-		    !list_empty(&fc_dentry->fcd_dilist)) {
+			!list_empty(&fc_dentry->fcd_dilist)) {
 			/* See the comment in ext4_fc_commit_dentry_updates(). */
 			ei = list_first_entry(&fc_dentry->fcd_dilist,
-					      struct ext4_inode_info,
-					      i_fc_dilist);
+						  struct ext4_inode_info,
+						  i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
 			spin_lock(&ei->i_fc_lock);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_REQUEUE);
+						   EXT4_STATE_FC_REQUEUE);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_COMMITTING);
+						   EXT4_STATE_FC_COMMITTING);
 			spin_unlock(&ei->i_fc_lock);
 			/*
 			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
-			 * visible before we send the wakeup. Pairs with implicit
-			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 * visible before we send the wakeup. Pairs with
+			 * implicit barrier in prepare_to_wait() in
+			 * ext4_fc_del().
 			 */
 			smp_mb();
 #if (BITS_PER_LONG < 64)
 			wake_up_bit(&ei->i_state_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #else
 			wake_up_bit(&ei->i_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
@@ -2589,13 +2683,20 @@ int __init ext4_fc_init_dentry_cache(void)
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
 					   SLAB_RECLAIM_ACCOUNT);
 
-	if (ext4_fc_dentry_cachep == NULL)
+	if (!ext4_fc_dentry_cachep)
 		return -ENOMEM;
 
+	ext4_fc_range_cachep = KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT);
+	if (!ext4_fc_range_cachep) {
+		kmem_cache_destroy(ext4_fc_dentry_cachep);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void ext4_fc_destroy_dentry_cache(void)
 {
+	kmem_cache_destroy(ext4_fc_range_cachep);
 	kmem_cache_destroy(ext4_fc_dentry_cachep);
 }
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 47 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 5ed884cc4b5c..f28e732e9be7 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1211,13 +1211,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		inodes[i] = igrab(&iter->vfs_inode);
-		if (inodes[i])
-			i++;
+		inodes[i++] = &iter->vfs_inode;
 	}
 
 	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
 		struct ext4_inode_info *ei;
+		struct inode *inode;
 
 		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
 			continue;
@@ -1227,12 +1226,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		/* See the comment in ext4_fc_commit_dentry_updates(). */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				      struct ext4_inode_info, i_fc_dilist);
+		inode = &ei->vfs_inode;
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
-		inodes[i] = igrab(&ei->vfs_inode);
-		if (inodes[i])
-			i++;
+		/*
+		 * Create-only inodes may only be referenced via fcd_dilist and
+		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
+		 * we are snapshotting, but inode eviction calls ext4_fc_del(),
+		 * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING
+		 * so the inode stays pinned and the snapshot stays valid until
+		 * ext4_fc_cleanup().
+		 */
+		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		inodes[i++] = inode;
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
@@ -1242,10 +1249,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 			break;
 	}
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		if (inodes[nr_inodes])
-			iput(inodes[nr_inodes]);
-	}
 	kvfree(inodes);
 	return ret;
 }
@@ -1313,8 +1316,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	jbd2_journal_lock_updates(journal);
 	/*
 	 * The journal is now locked. No more handles can start and all the
-	 * previous handles are now drained. We now mark the inodes on the
-	 * commit queue as being committed.
+	 * previous handles are now drained. Snapshotting happens in this
+	 * window so log writing can consume only stable snapshots without
+	 * doing logical-to-physical mapping.
 	 */
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
@@ -1566,6 +1570,25 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					      struct ext4_inode_info,
 					      i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
+			spin_lock(&ei->i_fc_lock);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_REQUEUE);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_COMMITTING);
+			spin_unlock(&ei->i_fc_lock);
+			/*
+			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+			 * visible before we send the wakeup. Pairs with implicit
+			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 */
+			smp_mb();
+#if (BITS_PER_LONG < 64)
+			wake_up_bit(&ei->i_state_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#else
+			wake_up_bit(&ei->i_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 111 ++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 66de888ae411..13fe4fdf9bda 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1995,6 +1995,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index dc19795dacdd..5ed884cc4b5c 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -61,9 +61,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -217,6 +216,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -251,22 +251,30 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 #if (BITS_PER_LONG < 64)
 		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -287,19 +295,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +382,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -557,8 +570,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -577,29 +588,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-#if (BITS_PER_LONG < 64)
-		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_state_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#else
-		DEFINE_WAIT_BIT(wait, &ei->i_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#endif
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1526,32 +1519,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 #if (BITS_PER_LONG < 64)
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
  without changing the subclassing logic.

 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 98857292c707..66de888ae411 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1016,12 +1016,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 578508eb4f1a..286f05834900 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1425,6 +1425,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5904,6 +5907,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox