Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-07 22:09 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAGtprH-kgRByFvvCYeWMXtsvpb6qpaWAo8k-3PEnioyPg-LEvA@mail.gmail.com>

On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > > TLDR:
> > >
> > > + Think of populate ioctls not as KVM touching memory, but platform
> > >   handling population.
> > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > > + post_populate is platform-specific code that handles loading into
> > >   private destination memory just to support legacy non-in-place
> > >   conversion.
> > > + Don't complicate populate ioctls by doing conversion just to support
> > >   legacy use-cases where platform-specific code has to do copying on
> > >   the host.
> >
> > That's a good point: these are only considerations in the context of
> > actually copying from src->dst, but with in-place conversion the
> > primary/more-performant approach will be for userspace to initial
> > directly. I.e. if we enforced that, then gmem could right ascertain that
> > it isn't even writing to private pages via these hooks and any
> > manipulation of that memory is purely on the part of the trusted entity
> > handling initial encryption/etc.
> >
> > I understand that we decided to keep the option of allowing separate
> > src/dst even with in-place conversion, but it doesn't seem worthwhile if
> > that necessarily means we need to glue population+conversion together in
> > 1 clumsy interface that needs to handle partial return/error responses to
> > userspace (or potentially get stuck forever in the conversion path).
> 
> I think ARM needs userspace to specify separate source and destination
> memory ranges for initial population as ARM doesn't support in-place
> memory encryption. [1]
> 
> [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
> 
> >
> > So I agree with Ackerley's proposal (which I guess is the same as what's
> > in this series).
> >
> > However, 1 other alternative would be to do what was suggested on the
> > call, but require userspace to subsequently handle the shared->private
> > conversion. I think that would be workable too.
> 
> IIUC, Converting memory ranges to private after it essentially is
> treated as private by the KVM CC backend will expose the
> implementation to the same risk of userspace being able to access
> private memory and compromise host safety which guest_memfd was
> invented to address.

Doh, fair point. Doing conversion as part of the populate call would allow
us to use the filemap write-lock to avoid userspace being able to fault
in private (as tracked by trusted entity) pages before they are
transitioned to private (as tracked by KVM), so it's safer than having
userspace drive it.

But obviously I still think Ackerley's original proposal has more
upsides than the alternatives mentioned so far.

-Mike

> 
> >
> > One other benefit to Ackerley's/current approach however is that it allows
> > us to potentially keep hugepages intact in the populate path, since
> > prep'ing/encrypting everything while it's in a shared state means gmem will
> > split the hugepage and all the firmware/RMP/etc. data structures will only
> > be able to handle individual 4K pages. I still suspect doing things like
> > encoding the initial 2MB OVMF image as a single hugepage might yield
> > enough benefit to explore this (at some point). So there's some niceness
> > in knowing that Ackerley's approach would allow for that eventually and
> > not require a complete rethink on these same topics.
> >
> > Thanks,
> >
> > Mike
> >
> > >
> > > >>>
> > > >>> [...snip...]
> > > >>>

^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-07 23:09 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Yafang Shao, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <adUd0Mojbtrwmeod@pathway.suse.cz>

On Tue, Apr 7, 2026 at 8:08 AM Petr Mladek <pmladek@suse.com> wrote:
[...]
> > + * @replace:   replace tag:
> > + *             = 0: Atomic replace is disabled; however, this patch remains
> > + *                  eligible to be superseded by others.
>
> This is weird semantic. Which livepatch tag would be allowed to
> supersede it, please?
>
> Do we still need this category?
>
> > + *             > 0: Atomic replace is enabled. Only existing patches with a
> > + *                  matching replace tag will be superseded.
> >   * @list:      list node for global list of actively used patches
> >   * @kobj:      kobject for sysfs resources
> >   * @obj_list:  dynamic list of the object entries
> > @@ -137,7 +141,7 @@ struct klp_patch {
> >         struct module *mod;
> >         struct klp_object *objs;
> >         struct klp_state *states;
> > -       bool replace;
> > +       unsigned int replace;
>
> This already breaks the backward compatibility by changing the type
> and semantic of this field.

I was thinking if replace=0 means no replace, it is still backward
compatible. Did I miss something?

Thanks,
Song

> I would also change the name to better
> match the new semantic. What about using:
>
>
>   * @replace_set: Livepatch using the same @replace_set will get
>                 atomically replaced, see also conflicts[*].
>
>         unsigned int replace_set;
>
> [*] A livepatch module, livepatching an already livepatches function,
>     can be loaded only when it has the same @replace_set number.
>
>     By other words, two livepatches conflict when they have a different
>     @replace_set number and have at least one livepatched version
>     in common.

[...]

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08  0:30 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Michael Roth, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEtigp7+PVDkyu_DH947CUqDt312d+P+hWjjd2fHONiag@mail.gmail.com>

On Fri, Apr 03, 2026, Ackerley Tng wrote:
> Currently, in TDX's populate flow, KVM doesn't do any copying, it only
> instructs TDX to do the copying.

I disagree with this statement.  For all intents and purposes, the TDX-Module is
firmware.  If Intel had elected to implement TDX via XuCode, and presented it to
software as ISA (see SGX), then under the hood "firmware" would still be doing the
actual copy, but KVM would be execute some form of "copy" instruction.

Saying "KVM doesn't do any copying" is (very loosely) analogous to saying that
KVM doesn't copy anything when it does REP MOVSQ.  It wasn't me your honor, Intel's
string engine did it!

I don't think it changes anything in practice, but I don't want to treat TDX
SEAMCALLs (or SNP PSP commands) as something completely different than what we
usually think of as "hardware".

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08  0:33 UTC (permalink / raw)
  To: Michael Roth
  Cc: Vishal Annapurve, Ackerley Tng, aik, andrew.jones, binbin.wu,
	brauner, chao.p.peng, david, ira.weiny, jmattson, jthoughton,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <nmy5polcxfnn3hoircsiqarxmzlulwwq7w34okanccntp32v56@h2eac44agovv>

On Tue, Apr 07, 2026, Michael Roth wrote:
> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> > On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > > TLDR:
> > > >
> > > > + Think of populate ioctls not as KVM touching memory, but platform
> > > >   handling population.
> > > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > > > + post_populate is platform-specific code that handles loading into
> > > >   private destination memory just to support legacy non-in-place
> > > >   conversion.
> > > > + Don't complicate populate ioctls by doing conversion just to support
> > > >   legacy use-cases where platform-specific code has to do copying on
> > > >   the host.
> > >
> > > That's a good point: these are only considerations in the context of
> > > actually copying from src->dst, but with in-place conversion the
> > > primary/more-performant approach will be for userspace to initial
> > > directly. I.e. if we enforced that, then gmem could right ascertain that
> > > it isn't even writing to private pages via these hooks and any
> > > manipulation of that memory is purely on the part of the trusted entity
> > > handling initial encryption/etc.
> > >
> > > I understand that we decided to keep the option of allowing separate
> > > src/dst even with in-place conversion, but it doesn't seem worthwhile if
> > > that necessarily means we need to glue population+conversion together in
> > > 1 clumsy interface that needs to handle partial return/error responses to
> > > userspace (or potentially get stuck forever in the conversion path).
> > 
> > I think ARM needs userspace to specify separate source and destination
> > memory ranges for initial population as ARM doesn't support in-place
> > memory encryption. [1]
> > 
> > [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
> > 
> > >
> > > So I agree with Ackerley's proposal (which I guess is the same as what's
> > > in this series).
> > >
> > > However, 1 other alternative would be to do what was suggested on the
> > > call, but require userspace to subsequently handle the shared->private
> > > conversion. I think that would be workable too.
> > 
> > IIUC, Converting memory ranges to private after it essentially is
> > treated as private by the KVM CC backend will expose the
> > implementation to the same risk of userspace being able to access
> > private memory and compromise host safety which guest_memfd was
> > invented to address.
> 
> Doh, fair point. Doing conversion as part of the populate call would allow
> us to use the filemap write-lock to avoid userspace being able to fault
> in private (as tracked by trusted entity) pages before they are
> transitioned to private (as tracked by KVM), so it's safer than having
> userspace drive it.
> 
> But obviously I still think Ackerley's original proposal has more
> upsides than the alternatives mentioned so far.

I'm a bit lost.  What exactly is/was Ackerley's original proposal?  If the answer
is "convert pages from shared=>private when populating via in-place conversion",
then I agree, because AFAICT, that's the only sane option.

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Steven Rostedt @ 2026-04-08  0:37 UTC (permalink / raw)
  To: CaoRuichuang, Shuah Khan
  Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407102613.81419-1-create0818@163.com>


Shuah,

Care to take this through your tree. Probably could even add:

Cc: stable@vger.kernel.org
Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")

As well as:

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve


On Tue,  7 Apr 2026 18:26:13 +0800
CaoRuichuang <create0818@163.com> wrote:

> From: Cao Ruichuang <create0818@163.com>
> 
> test_ownership.tc is sourced by ftracetest under /bin/sh.
> 
> The script currently declares mount_point with local at file scope,
> which makes /bin/sh abort with "local: not in a function" before the
> test can reach the eventfs ownership checks.
> 
> Replace the top-level local declaration with a normal shell variable so
> kernels that support the gid= tracefs mount option can run the test at
> all.
> 
> Signed-off-by: Cao Ruichuang <create0818@163.com>
> ---
>  tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> index e71cc3ad0..6d00d3c0f 100644
> --- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> +++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> @@ -6,7 +6,7 @@
>  original_group=`stat -c "%g" .`
>  original_owner=`stat -c "%u" .`
>  
> -local mount_point=$(get_mount_point)
> +mount_point=$(get_mount_point)
>  
>  mount_options=$(get_mnt_options "$mount_point")
>  


^ permalink raw reply

* Re: [PATCH 14/24] nfsd: add tracepoint to dir_event handler
From: Steven Rostedt @ 2026-04-08  0:41 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker,
	Amir Goldstein, Calum Mackay, linux-fsdevel, linux-kernel,
	linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-14-aaf68c478abd@kernel.org>

On Tue, 07 Apr 2026 09:21:27 -0400
Jeff Layton <jlayton@kernel.org> wrote:

> Add some extra visibility around the fsnotify handlers.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/nfs4state.c |  2 ++
>  fs/nfsd/trace.h     | 20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Account for fprobe attachment at creation
From: Masami Hiramatsu @ 2026-04-08  0:49 UTC (permalink / raw)
  To: Cao Ruichuang
  Cc: rostedt, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407115751.96184-1-create0818@163.com>

On Tue,  7 Apr 2026 19:57:51 +0800
Cao Ruichuang <create0818@163.com> wrote:

> add_remove_fprobe.tc assumes that enabling an fprobe event is what adds
> its target function to enabled_functions.
> 
> On the current kernel, the fprobe target already appears in
> enabled_functions as soon as the event is created, and enabling the
> event does not change that count again. That makes the test fail even
> though the event lifecycle itself works.
> 
> Record the attachment baseline after creating the probe events and only
> check that enabling them keeps the expected functions attached. The
> cleanup checks still verify that removing the events returns
> enabled_functions to its original state.
> 

Hmm, did you run this test with not clean environment?
It is not designed to do so.
(I think we should not check ocnt, or if there are any entries,
 we should stop this test as unresolved.)

Thank you,

> Signed-off-by: Cao Ruichuang <create0818@163.com>
> ---
>  .../test.d/dynevent/add_remove_fprobe.tc      | 28 +++++++++++++------
>  1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
> index 47067a5e3..ff08bd1ac 100644
> --- a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
> +++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
> @@ -26,23 +26,29 @@ grep -q myevent2 dynamic_events
>  grep -q myevent3 dynamic_events
>  test -d events/fprobes/myevent1
>  test -d events/fprobes/myevent2
> -
> -echo 1 > events/fprobes/myevent1/enable
> -# Make sure the event is attached.
>  grep -q $PLACE enabled_functions
> +grep -q $PLACE2 enabled_functions
>  cnt=`cat enabled_functions | wc -l`
> -if [ $cnt -eq $ocnt ]; then
> +if [ $cnt -le $ocnt ]; then
> +	exit_fail
> +fi
> +
> +echo 1 > events/fprobes/myevent1/enable
> +cnt1=`cat enabled_functions | wc -l`
> +if [ $cnt1 -ne $cnt ]; then
>  	exit_fail
>  fi
>  
>  echo 1 > events/fprobes/myevent2/enable
>  cnt2=`cat enabled_functions | wc -l`
> +if [ $cnt2 -ne $cnt1 ]; then
> +	exit_fail
> +fi
>  
>  echo 1 > events/fprobes/myevent3/enable
> -# If the function is different, the attached function should be increased
>  grep -q $PLACE2 enabled_functions
>  cnt=`cat enabled_functions | wc -l`
> -if [ $cnt -eq $cnt2 ]; then
> +if [ $cnt -ne $cnt2 ]; then
>  	exit_fail
>  fi
>  
> @@ -62,11 +68,15 @@ if [ $cnt -ne $ocnt ]; then
>  fi
>  
>  echo "f:myevent4 $PLACE" >> dynamic_events
> +grep -q $PLACE enabled_functions
> +cnt=`cat enabled_functions | wc -l`
> +if [ $cnt -le $ocnt ]; then
> +	exit_fail
> +fi
>  
>  echo 1 > events/fprobes/myevent4/enable
> -# Should only have one enabled
> -cnt=`cat enabled_functions | wc -l`
> -if [ $cnt -ne $((ocnt + 1)) ]; then
> +cnt2=`cat enabled_functions | wc -l`
> +if [ $cnt2 -ne $cnt ]; then
>  	exit_fail
>  fi
>  
> -- 
> 2.39.5 (Apple Git-154)
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Steven Rostedt @ 2026-04-08  1:05 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	Tom Zanussi
In-Reply-To: <20260401112224.85582-1-pengpeng@iscas.ac.cn>


Tom,

On Wed,  1 Apr 2026 19:22:23 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:

> hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
> qualified variable-reference names, but it currently appends into that
> buffer with strcat() without rebuilding it first. As a result, repeated
> calls append a new "system.event.field" name onto the previous one,
> which can eventually run past the end of full_name.
> 
> Build the name with snprintf() on each call and return NULL if the fully
> qualified name does not fit in MAX_FILTER_STR_VAL.
> 
> Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
> Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
> 
> - rebuild full_name on each call instead of falling back to field->name
> - return NULL on overflow as suggested
> - split out the snprintf() length check instead of using an inline if
> 
>  kernel/trace/trace_events_hist.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> index 73ea180cad55..f9c8a4f078ea 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
>  		 field->flags & HIST_FIELD_FL_VAR_REF) {
>  		if (field->system) {
>  			static char full_name[MAX_FILTER_STR_VAL];
> +			int len;
> +
> +			len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
> +				       field->system, field->event_name,
> +				       field->name);
> +			if (len >= sizeof(full_name))
> +				return NULL;
>  
> -			strcat(full_name, field->system);
> -			strcat(full_name, ".");
> -			strcat(full_name, field->event_name);
> -			strcat(full_name, ".");
> -			strcat(full_name, field->name);
>  			field_name = full_name;

I wanted to test this but I can't find anything that triggers this path.
How does a field here get its ->system set?

If there's no way to hit this path, I much rather remove it than "fix" it.

-- Steve


>  		} else
>  			field_name = field->name;


^ permalink raw reply

* Re: [PATCH] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu @ 2026-04-08  1:07 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Menglong Dong, Mathieu Desnoyers, jiang.biao,
	linux-kernel, linux-trace-kernel
In-Reply-To: <177555745233.1441308.6535024692186959381.stgit@mhiramat.tok.corp.google.com>

Hi,

Sashiko reported another concern in module callback [1].
Let me fix it.

[1] https://sashiko.dev/#/patchset/177555745233.1441308.6535024692186959381.stgit%40mhiramat.tok.corp.google.com

Thanks,

On Tue,  7 Apr 2026 19:24:12 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
> introduced a different ftrace_ops for entry-only fprobes.
> 
> However, when unregistering an fprobe, the kernel only checks if another
> fprobe exists at the same address, without checking which type of fprobe
> it is.
> If different fprobes are registered at the same address, the same address
> will be registered in both fgraph_ops and ftrace_ops, but only one of
> them will be deleted when unregistering. (the one removed first will not
> be deleted from the ops).
> 
> This results in junk entries remaining in either fgraph_ops or ftrace_ops.
> For example:
>  =======
>  cd /sys/kernel/tracing
> 
>  # 'Add entry and exit events on the same place'
>  echo 'f:event1 vfs_read' >> dynamic_events
>  echo 'f:event2 vfs_read%return' >> dynamic_events
> 
>  # 'Enable both of them'
>  echo 1 > events/fprobes/enable
>  cat enabled_functions
> vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210
> 
>  # 'Disable and remove exit event'
>  echo 0 > events/fprobes/event2/enable
>  echo -:event2 >> dynamic_events
> 
>  # 'Disable and remove all events'
>  echo 0 > events/fprobes/enable
>  echo > dynamic_events
> 
>  # 'Add another event'
>  echo 'f:event3 vfs_open%return' > dynamic_events
>  cat dynamic_events
> f:fprobes/event3 vfs_open%return
> 
>  echo 1 > events/fprobes/enable
>  cat enabled_functions
> vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
> vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
>  =======
> 
> As you can see, an entry for the vfs_read remains.
> 
> To fix this issue, when unregistering, the kernel should also check if
> there is the same type of fprobes still exist at the same address, and
> if not, delete its entry from either fgraph_ops or ftrace_ops.
> 
> Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  kernel/trace/fprobe.c |   77 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 62 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index dcadf1d23b8a..7f75e6e4462c 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -85,11 +85,9 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node)
>  	return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
>  }
>  
> -/* Return true if there are synonims */
> -static bool delete_fprobe_node(struct fprobe_hlist_node *node)
> +static void delete_fprobe_node(struct fprobe_hlist_node *node)
>  {
>  	lockdep_assert_held(&fprobe_mutex);
> -	bool ret;
>  
>  	/* Avoid double deleting */
>  	if (READ_ONCE(node->fp) != NULL) {
> @@ -97,13 +95,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
>  		rhltable_remove(&fprobe_ip_table, &node->hlist,
>  				fprobe_rht_params);
>  	}
> -
> -	rcu_read_lock();
> -	ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
> -				fprobe_rht_params);
> -	rcu_read_unlock();
> -
> -	return ret;
>  }
>  
>  /* Check existence of the fprobe */
> @@ -337,6 +328,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
>  	return !fp->exit_handler;
>  }
>  
> +static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
> +{
> +	struct rhlist_head *head, *pos;
> +	struct fprobe_hlist_node *node;
> +	struct fprobe *fp;
> +
> +	guard(rcu)();
> +	head = rhltable_lookup(&fprobe_ip_table, &ip,
> +				fprobe_rht_params);
> +	if (!head)
> +		return false;
> +	/* We have to check the same type on the list. */
> +	rhl_for_each_entry_rcu(node, pos, head, hlist) {
> +		if (node->addr != ip)
> +			break;
> +		fp = READ_ONCE(node->fp);
> +		if (likely(fp)) {
> +			if ((!ftrace && fp->exit_handler) ||
> +			    (ftrace && !fp->exit_handler))
> +				return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_MODULES
>  static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
>  			   int reset)
> @@ -360,6 +377,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
>  	return false;
>  }
>  
> +static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
> +{
> +	struct rhlist_head *head, *pos;
> +	struct fprobe_hlist_node *node;
> +	struct fprobe *fp;
> +
> +	guard(rcu)();
> +	head = rhltable_lookup(&fprobe_ip_table, &ip,
> +				fprobe_rht_params);
> +	if (!head)
> +		return false;
> +	/* We only need to check fp is there. */
> +	rhl_for_each_entry_rcu(node, pos, head, hlist) {
> +		if (node->addr != ip)
> +			break;
> +		fp = READ_ONCE(node->fp);
> +		if (likely(fp))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_MODULES
>  static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
>  			   int reset)
> @@ -574,15 +614,20 @@ static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long ad
>  static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
>  					 struct fprobe_addr_list *alist)
>  {
> +	lockdep_assert_in_rcu_read_lock();
> +
>  	if (!within_module(node->addr, mod))
>  		return;
> -	if (delete_fprobe_node(node))
> -		return;
> +
> +	delete_fprobe_node(node);
>  	/*
> -	 * If failed to update alist, just continue to update hlist.
> +	 * Ignore failure of updating alist, but continue to update hlist.
>  	 * Therefore, at list user handler will not hit anymore.
> +	 * And don't care the type here, because all fprobes on the same
> +	 * address must be removed eventually.
>  	 */
> -	fprobe_addr_list_add(alist, node->addr);
> +	if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params))
> +		fprobe_addr_list_add(alist, node->addr);
>  }
>  
>  /* Handle module unloading to manage fprobe_ip_table. */
> @@ -943,7 +988,9 @@ int unregister_fprobe(struct fprobe *fp)
>  	/* Remove non-synonim ips from table and hash */
>  	count = 0;
>  	for (i = 0; i < hlist_array->size; i++) {
> -		if (!delete_fprobe_node(&hlist_array->array[i]))
> +		delete_fprobe_node(&hlist_array->array[i]);
> +		if (!fprobe_exists_on_hash(hlist_array->array[i].addr,
> +					   fprobe_is_ftrace(fp)))
>  			addrs[count++] = hlist_array->array[i].addr;
>  	}
>  	del_fprobe_hash(fp);
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next v7 0/2] tracing: Fix kprobe attachment when module shadows vmlinux symbol
From: patchwork-bot+netdevbpf @ 2026-04-08  1:20 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: bpf, linux-trace-kernel, ast, daniel, andrii, jolsa, rostedt,
	mhiramat, ihor.solodrai, emil, linux-open-source
In-Reply-To: <20260407203912.1787502-1-andrey.grodzovsky@crowdstrike.com>

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 7 Apr 2026 16:39:10 -0400 you wrote:
> When a kernel module exports a symbol with the same name as an existing
> vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
> number_of_same_symbols() counts matches across both vmlinux and all
> loaded modules, returning a count greater than 1.
> 
> This series takes a different approach from v1-v4, which implemented a
> libbpf-side fallback parsing /proc/kallsyms and retrying with the
> absolute address. That approach was rejected (Andrii Nakryiko, Ihor
> Solodrai) because ambiguous symbol resolution does not belong in libbpf.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v7,1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
    https://git.kernel.org/bpf/bpf-next/c/1870ddcd94b0
  - [bpf-next,v7,2/2] selftests/bpf: Add tests for kprobe attachment with duplicate symbols
    https://git.kernel.org/bpf/bpf-next/c/cea4323f1cfe

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Pengpeng Hou @ 2026-04-08  2:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Tom Zanussi,
	linux-trace-kernel, linux-kernel, pengpeng
In-Reply-To: <20260401112224.85582-1-pengpeng@iscas.ac.cn>

Hi Steve,

Thanks for looking at this.

I believe this path is reachable.

field->system is set when a VAR_REF is created with an event-qualified
reference. The assignment happens in init_var_ref(), which is called
from create_var_ref().

The parser accepts fully qualified references of the form

  system.event.$var

and parse_expr() passes the parsed system/event strings into
create_var_ref(). trace_action_create() also accepts explicitly
qualified action parameters in the same form and passes those strings
into create_var_ref().

This is also documented in Documentation/trace/histogram.rst:

  fully-qualified name is of the form 'system.event_name.$var_name'
  or 'system.event_name.field'.

The path is user-visible because reading back the trigger goes through

  event_hist_trigger_print() -> hist_field_print() -> hist_field_name()

so creating a hist trigger that uses an inter-event variable reference
in fully qualified form is enough to exercise it.

So I don't think this is dead code. If you would prefer a different way
of handling the printed name, I can respin accordingly, but I don't
think the current branch is unreachable.

Thanks,
Pengpeng

^ permalink raw reply

* Re: [PATCH] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu @ 2026-04-08  2:02 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Menglong Dong, Mathieu Desnoyers, jiang.biao,
	linux-kernel, linux-trace-kernel
In-Reply-To: <177555745233.1441308.6535024692186959381.stgit@mhiramat.tok.corp.google.com>

Hmm,

I also found another problem when probing module with fprobe.

/sys/kernel/tracing # insmod xt_LOG.ko 
/sys/kernel/tracing # echo 'f:test log_tg*' > dynamic_events 
/sys/kernel/tracing # echo 1 > events/fprobes/test/enable 
/sys/kernel/tracing # cat enabled_functions 
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
/sys/kernel/tracing # wc -l enabled_functions 
3 enabled_functions
/sys/kernel/tracing # rmmod xt_LOG
/sys/kernel/tracing # wc -l enabled_functions 
34085 enabled_functions

It seems to reverse the selected functions if the hash is empty...

Thanks,

On Tue,  7 Apr 2026 19:24:12 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
> introduced a different ftrace_ops for entry-only fprobes.
> 
> However, when unregistering an fprobe, the kernel only checks if another
> fprobe exists at the same address, without checking which type of fprobe
> it is.
> If different fprobes are registered at the same address, the same address
> will be registered in both fgraph_ops and ftrace_ops, but only one of
> them will be deleted when unregistering. (the one removed first will not
> be deleted from the ops).
> 
> This results in junk entries remaining in either fgraph_ops or ftrace_ops.
> For example:
>  =======
>  cd /sys/kernel/tracing
> 
>  # 'Add entry and exit events on the same place'
>  echo 'f:event1 vfs_read' >> dynamic_events
>  echo 'f:event2 vfs_read%return' >> dynamic_events
> 
>  # 'Enable both of them'
>  echo 1 > events/fprobes/enable
>  cat enabled_functions
> vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210
> 
>  # 'Disable and remove exit event'
>  echo 0 > events/fprobes/event2/enable
>  echo -:event2 >> dynamic_events
> 
>  # 'Disable and remove all events'
>  echo 0 > events/fprobes/enable
>  echo > dynamic_events
> 
>  # 'Add another event'
>  echo 'f:event3 vfs_open%return' > dynamic_events
>  cat dynamic_events
> f:fprobes/event3 vfs_open%return
> 
>  echo 1 > events/fprobes/enable
>  cat enabled_functions
> vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
> vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
>  =======
> 
> As you can see, an entry for the vfs_read remains.
> 
> To fix this issue, when unregistering, the kernel should also check if
> there is the same type of fprobes still exist at the same address, and
> if not, delete its entry from either fgraph_ops or ftrace_ops.
> 
> Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  kernel/trace/fprobe.c |   77 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 62 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index dcadf1d23b8a..7f75e6e4462c 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -85,11 +85,9 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node)
>  	return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
>  }
>  
> -/* Return true if there are synonims */
> -static bool delete_fprobe_node(struct fprobe_hlist_node *node)
> +static void delete_fprobe_node(struct fprobe_hlist_node *node)
>  {
>  	lockdep_assert_held(&fprobe_mutex);
> -	bool ret;
>  
>  	/* Avoid double deleting */
>  	if (READ_ONCE(node->fp) != NULL) {
> @@ -97,13 +95,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
>  		rhltable_remove(&fprobe_ip_table, &node->hlist,
>  				fprobe_rht_params);
>  	}
> -
> -	rcu_read_lock();
> -	ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
> -				fprobe_rht_params);
> -	rcu_read_unlock();
> -
> -	return ret;
>  }
>  
>  /* Check existence of the fprobe */
> @@ -337,6 +328,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
>  	return !fp->exit_handler;
>  }
>  
> +static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
> +{
> +	struct rhlist_head *head, *pos;
> +	struct fprobe_hlist_node *node;
> +	struct fprobe *fp;
> +
> +	guard(rcu)();
> +	head = rhltable_lookup(&fprobe_ip_table, &ip,
> +				fprobe_rht_params);
> +	if (!head)
> +		return false;
> +	/* We have to check the same type on the list. */
> +	rhl_for_each_entry_rcu(node, pos, head, hlist) {
> +		if (node->addr != ip)
> +			break;
> +		fp = READ_ONCE(node->fp);
> +		if (likely(fp)) {
> +			if ((!ftrace && fp->exit_handler) ||
> +			    (ftrace && !fp->exit_handler))
> +				return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_MODULES
>  static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
>  			   int reset)
> @@ -360,6 +377,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
>  	return false;
>  }
>  
> +static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
> +{
> +	struct rhlist_head *head, *pos;
> +	struct fprobe_hlist_node *node;
> +	struct fprobe *fp;
> +
> +	guard(rcu)();
> +	head = rhltable_lookup(&fprobe_ip_table, &ip,
> +				fprobe_rht_params);
> +	if (!head)
> +		return false;
> +	/* We only need to check fp is there. */
> +	rhl_for_each_entry_rcu(node, pos, head, hlist) {
> +		if (node->addr != ip)
> +			break;
> +		fp = READ_ONCE(node->fp);
> +		if (likely(fp))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_MODULES
>  static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
>  			   int reset)
> @@ -574,15 +614,20 @@ static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long ad
>  static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
>  					 struct fprobe_addr_list *alist)
>  {
> +	lockdep_assert_in_rcu_read_lock();
> +
>  	if (!within_module(node->addr, mod))
>  		return;
> -	if (delete_fprobe_node(node))
> -		return;
> +
> +	delete_fprobe_node(node);
>  	/*
> -	 * If failed to update alist, just continue to update hlist.
> +	 * Ignore failure of updating alist, but continue to update hlist.
>  	 * Therefore, at list user handler will not hit anymore.
> +	 * And don't care the type here, because all fprobes on the same
> +	 * address must be removed eventually.
>  	 */
> -	fprobe_addr_list_add(alist, node->addr);
> +	if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params))
> +		fprobe_addr_list_add(alist, node->addr);
>  }
>  
>  /* Handle module unloading to manage fprobe_ip_table. */
> @@ -943,7 +988,9 @@ int unregister_fprobe(struct fprobe *fp)
>  	/* Remove non-synonim ips from table and hash */
>  	count = 0;
>  	for (i = 0; i < hlist_array->size; i++) {
> -		if (!delete_fprobe_node(&hlist_array->array[i]))
> +		delete_fprobe_node(&hlist_array->array[i]);
> +		if (!fprobe_exists_on_hash(hlist_array->array[i].addr,
> +					   fprobe_is_ftrace(fp)))
>  			addrs[count++] = hlist_array->array[i].addr;
>  	}
>  	del_fprobe_hash(fp);
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Steven Rostedt @ 2026-04-08  2:14 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Tom Zanussi,
	linux-trace-kernel, linux-kernel
In-Reply-To: <20260408101500.1-tracing-hist-reachability-reply-pengpeng@iscas.ac.cn>

On Wed, 8 Apr 2026 10:15:00 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:

> Hi Steve,
> 
> Thanks for looking at this.
> 
> I believe this path is reachable.
> 
> field->system is set when a VAR_REF is created with an event-qualified
> reference. The assignment happens in init_var_ref(), which is called
> from create_var_ref().
> 
> The parser accepts fully qualified references of the form
> 
>   system.event.$var
> 
> and parse_expr() passes the parsed system/event strings into
> create_var_ref(). trace_action_create() also accepts explicitly
> qualified action parameters in the same form and passes those strings
> into create_var_ref().
> 
> This is also documented in Documentation/trace/histogram.rst:
> 
>   fully-qualified name is of the form 'system.event_name.$var_name'
>   or 'system.event_name.field'.

Ah I read the document and missed this part. Thanks for pointing it out.

None of our tests use this, so that definitely needs to be fixed.

OK, yeah I can trigger that path now. I'll continue looking at the patch.

Thanks,

-- Steve


> 
> The path is user-visible because reading back the trigger goes through
> 
>   event_hist_trigger_print() -> hist_field_print() -> hist_field_name()
> 
> so creating a hist trigger that uses an inter-event variable reference
> in fully qualified form is enough to exercise it.
> 
> So I don't think this is dead code. If you would prefer a different way
> of handling the printed name, I can respin accordingly, but I don't
> think the current branch is unreachable.
> 
> Thanks,
> Pengpeng
> 


^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-08  2:40 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Song Liu, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <adUd0Mojbtrwmeod@pathway.suse.cz>

On Tue, Apr 7, 2026 at 11:08 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Tue 2026-04-07 17:45:31, Yafang Shao wrote:
> > On Tue, Apr 7, 2026 at 11:16 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
> > > >
> > > > On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> > > > [...]
> > > > > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > > > >   are replaceable.
> > > > > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > > > >   and are not replaceable.
> > > > > > > >
> > > > > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > > > > livepatch. Is this the expected use case?
> > > > > > >
> > > > > > > That matches our expected use case.
> > > > > >
> > > > > > If we really want to serve use cases like this, I think we can introduce
> > > > > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > > > > Newly loaded livepatch will only replace existing livepatch with the
> > > > > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > > > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > > > > replace tag.
> > > > > >
> > > > > > For current users of cumulative patches, all the livepatch will have the
> > > > > > same tag, say 1. For your use case, you can assign each user a
> > > > > > unique tag. Then all these users can do atomic upgrades of their
> > > > > > own livepatches.
> > > > > >
> > > > > > We may also need to check whether two livepatches of different tags
> > > > > > touch the same kernel function. When that happens, the later
> > > > > > livepatch should fail to load.
>
> I still think how to make the hybrid mode more secure:
>
>     + The isolated sets of livepatched functions look like a good rule.
>     + What about isolating the shadow variables/states as well?

We might consider extending the klp_shadow_* API to support the new
livepatch tag.

>
> > > That sounds like a viable solution. I'll look into it and see how we
> > > can implement it.
> >
> > Does the following change look good to you ?
> >
> > Subject: [PATCH] livepatch: Support scoped atomic replace using replace tags
> >
> > Extend the replace attribute from a boolean to a u32 to act as a replace
> > tag. This introduces the following semantics:
> >
> >   replace = 0: Atomic replace is disabled. However, this patch remains
> >                eligible to be superseded by others.
> >   replace > 0: Enables tagged replace (default is 1). A newly loaded
> >                livepatch will only replace existing patches that share the
> >                same tag.
> >
> > To maintain backward compatibility, a patch with replace == 0 does not
> > trigger an outgoing atomic replace, but remains eligible to be superseded
> > by any incoming patch with a valid replace tag.
> >
> > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> > index ba9e3988c07c..417c67a17b99 100644
> > --- a/include/linux/livepatch.h
> > +++ b/include/linux/livepatch.h
> > @@ -123,7 +123,11 @@ struct klp_state {
> >   * @mod:       reference to the live patch module
> >   * @objs:      object entries for kernel objects to be patched
> >   * @states:    system states that can get modified
> > - * @replace:   replace all actively used patches
> > + * @replace:   replace tag:
> > + *             = 0: Atomic replace is disabled; however, this patch remains
> > + *                  eligible to be superseded by others.
>
> This is weird semantic. Which livepatch tag would be allowed to
> supersede it, please?
>
> Do we still need this category?

It can be superseded by any livepatch that has a non-zero tag set.

This ensures backward compatibility: while a non-atomic-replace
livepatch can be superseded by an atomic-replace one, the reverse is
not permitted—an atomic-replace livepatch cannot be superseded by a
non-atomic one.

>
> > + *             > 0: Atomic replace is enabled. Only existing patches with a
> > + *                  matching replace tag will be superseded.
> >   * @list:      list node for global list of actively used patches
> >   * @kobj:      kobject for sysfs resources
> >   * @obj_list:  dynamic list of the object entries
> > @@ -137,7 +141,7 @@ struct klp_patch {
> >         struct module *mod;
> >         struct klp_object *objs;
> >         struct klp_state *states;
> > -       bool replace;
> > +       unsigned int replace;
>
> This already breaks the backward compatibility

It doesn't break backward compatibility.

> by changing the type
> and semantic of this field. I would also change the name to better
> match the new semantic. What about using:
>
>
>   * @replace_set: Livepatch using the same @replace_set will get
>                 atomically replaced, see also conflicts[*].
>
>         unsigned int replace_set;
>
> [*] A livepatch module, livepatching an already livepatches function,
>     can be loaded only when it has the same @replace_set number.
>
>     By other words, two livepatches conflict when they have a different
>     @replace_set number and have at least one livepatched version
>     in common.

Renaming it to @replace_set makes sense to me.

>
> >
> >         /* internal */
> >         struct list_head list;
> > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> > index 28d15ba58a26..e4e5c03b0724 100644
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -793,6 +793,8 @@ void klp_free_replaced_patches_async(struct
> > klp_patch *new_patch)
> >         klp_for_each_patch_safe(old_patch, tmp_patch) {
> >                 if (old_patch == new_patch)
> >                         return;
> > +               if (old_patch->replace && old_patch->replace !=
> > new_patch->replace)
> > +                       continue;
> >                 klp_free_patch_async(old_patch);
> >         }
> >  }
> > @@ -1194,6 +1196,8 @@ void klp_unpatch_replaced_patches(struct
> > klp_patch *new_patch)
> >         klp_for_each_patch(old_patch) {
> >                 if (old_patch == new_patch)
> >                         return;
> > +               if (old_patch->replace && old_patch->replace !=
> > new_patch->replace)
> > +                       continue;
> >
> >                 old_patch->enabled = false;
> >                 klp_unpatch_objects(old_patch);
>
> This handles only the freeing part. More changes will be
> necessary:
>
>    + klp_is_patch_compatible() must check also conflicts
>      between livepatches with different @replace_set.
>      The conflicts might be in the lists of:
>
>         + livepatched functions
>         + state IDs (aka callbacks and shadow variables IDs)
>
>    + klp_add_nops() must skip livepatches with another @replace_set
>
>    + klp_unpatch_replaced_patches() should unpatch only
>      patches with the same @replace_set

I appreciate the guidance on this.

>
> Finally, we would need to update existing selftests
> plus add new selftests.
>
> It is possible that I have missed something.
>
> Anyway, you should wait for more feedback before you do too much
> coding, especially the selftests are not needed at RFC stage.

Sure.

-- 
Regards
Yafang

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Account for fprobe attachment at creation
From: Cao Ruichuang @ 2026-04-08  4:09 UTC (permalink / raw)
  To: mhiramat; +Cc: rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260408094920.417cbb5e64220afc2bf1b2aa@kernel.org>

Thanks, that is a good point.

I need to separate a clean source tree from a clean tracing runtime
environment. I will rerun this in a fresh tracing environment and
re-evaluate the testcase assumption before pushing anything further.

If this testcase is only valid for a clean environment, then an
unresolved check or a different precondition will make more sense
than the current patch.

Thanks,
Cao Ruichuang

^ permalink raw reply

* [PATCH v2] selftests/ftrace: Account for fprobe attachment before enable
From: Cao Ruichuang @ 2026-04-08  4:25 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, shuah, linux-kernel, linux-trace-kernel,
	linux-kselftest
In-Reply-To: <20260407115751.96184-1-create0818@163.com>

add_remove_fprobe.tc assumes that enabling an fprobe event is what
adds its target function to enabled_functions.

After resetting tracing state with initialize_system(), the current
kernel still attaches the target functions as soon as the probe events
are created. Enabling the three events does not increase the
enabled_functions count further, and the original testcase fails on the
check after enabling myevent3.

Record the baseline after creating the fprobe events and verify that
subsequent enables keep that attachment set unchanged.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
v2:
- rerun in a reset tracing environment after initialize_system()
- keep the fix focused on the failing myevent3 assumption and drop the
  unnecessary myevent4 changes

 .../ftrace/test.d/dynevent/add_remove_fprobe.tc | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
index 47067a5e3..a55179f5d 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
@@ -27,22 +27,29 @@ grep -q myevent3 dynamic_events
 test -d events/fprobes/myevent1
 test -d events/fprobes/myevent2
 
-echo 1 > events/fprobes/myevent1/enable
-# Make sure the event is attached.
 grep -q $PLACE enabled_functions
+grep -q $PLACE2 enabled_functions
 cnt=`cat enabled_functions | wc -l`
-if [ $cnt -eq $ocnt ]; then
+if [ $cnt -le $ocnt ]; then
+	exit_fail
+fi
+
+echo 1 > events/fprobes/myevent1/enable
+cnt1=`cat enabled_functions | wc -l`
+if [ $cnt1 -ne $cnt ]; then
 	exit_fail
 fi
 
 echo 1 > events/fprobes/myevent2/enable
 cnt2=`cat enabled_functions | wc -l`
+if [ $cnt2 -ne $cnt1 ]; then
+	exit_fail
+fi
 
 echo 1 > events/fprobes/myevent3/enable
-# If the function is different, the attached function should be increased
 grep -q $PLACE2 enabled_functions
 cnt=`cat enabled_functions | wc -l`
-if [ $cnt -eq $cnt2 ]; then
+if [ $cnt -ne $cnt2 ]; then
 	exit_fail
 fi
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* [PATCH] selftests/ftrace: Quote check_requires comparisons
From: Cao Ruichuang @ 2026-04-08  4:32 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, shuah, linux-kernel, linux-trace-kernel,
	linux-kselftest

check_requires() compares requirement strings that can contain shell
pattern characters such as '[' and ']'. Under /bin/sh, the unquoted
test expressions can emit 'unexpected operator' warnings while parsing
README-backed requirements.

Quote the relevant comparisons and path checks so the helper handles
those patterns without spurious shell warnings.

Validated by rerunning fprobe_syntax_errors.tc and confirming the
previous '/bin/sh: unexpected operator' lines disappear from the
detailed ftracetest log.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
 tools/testing/selftests/ftrace/test.d/functions | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/functions b/tools/testing/selftests/ftrace/test.d/functions
index e8e718139..442aa28ff 100644
--- a/tools/testing/selftests/ftrace/test.d/functions
+++ b/tools/testing/selftests/ftrace/test.d/functions
@@ -145,13 +145,13 @@ check_requires() { # Check required files and tracers
 	p=${i%:program}
         r=${i%:README}
         t=${i%:tracer}
-	if [ $p != $i ]; then
-	    if ! which $p ; then
+	if [ "$p" != "$i" ]; then
+	    if ! which "$p" ; then
                 echo "Required program $p is not found."
                 exit_unresolved
 	    fi
-        elif [ $t != $i ]; then
-            if ! grep -wq $t available_tracers ; then
+        elif [ "$t" != "$i" ]; then
+            if ! grep -wq "$t" available_tracers ; then
                 echo "Required tracer $t is not configured."
                 exit_unsupported
             fi
@@ -162,11 +162,11 @@ check_requires() { # Check required files and tracers
 	    else
 		test=$TRACING_DIR
 	    fi
-            if ! grep -Fq "$r" $test/README ; then
+            if ! grep -Fq "$r" "$test"/README ; then
                 echo "Required feature pattern \"$r\" is not in README."
                 exit_unsupported
             fi
-        elif [ ! -e $i ]; then
+        elif [ ! -e "$i" ]; then
             echo "Required feature interface $i doesn't exist."
             exit_unsupported
         fi
@@ -223,4 +223,4 @@ get_mnt_options() {
 	local opts=$(mount | grep -m1 "$mnt_point" | sed -e 's/.*(\(.*\)).*/\1/')
 
 	echo "$opts"
-}
\ No newline at end of file
+}
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Song Liu @ 2026-04-08  6:51 UTC (permalink / raw)
  To: Yafang Shao
  Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
	andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
	linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CALOAHbAmTAfamStF9sZtO6efWYJ1sbXJp3PbsVapZf7dba91ig@mail.gmail.com>

On Mon, Apr 6, 2026 at 8:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > We can define the struct_ops in an OOT kernel module. Then we
> > can attach BPF programs to the struct_ops. We may need
> > livepatch to connect the new struct_ops to original kernel logic.
> >
> > I think kernel side of this solution is mostly available, but we may
> > need some work on the toolchain side.
> >
> > Does this make sense?
>
> Are there actual benefits to using struct_ops instead of
> bpf_override_return? So far, I’ve only found it adds complexity
> without much gain.

Yes, struct_ops can be more powerful. For example, we can use
something like the following to modify the content of /proc/cmdline
with a bpf program. AFAICT, this is not possible with
bpf_override_return.

Note that this example doesn't require kernel changes. This is
actually a fun example. I will send the full code as a new self test.

Thanks,
Song


======= key logic of the livepatch module =======

static int livepatch_cmdline_proc_show(struct seq_file *m, void *v)
{
        struct klp_bpf_cmdline_ops *ops = READ_ONCE(active_ops);

        if (ops && ops->set_cmdline)
                return ops->set_cmdline(m);

        seq_printf(m, "%s: no struct_ops attached\n", THIS_MODULE->name);
        return 0;
}

static struct klp_func funcs[] = {
        {
                .old_name = "cmdline_proc_show",
                .new_func = livepatch_cmdline_proc_show,
        }, { }
};


========== key logic of the bpf program =========

SEC("struct_ops/set_cmdline")
int BPF_PROG(set_cmdline, struct seq_file *m)
{
        char custom[] = "klp_bpf: custom cmdline\n";

        bpf_klp_seq_write(m, custom, sizeof(custom) - 1);
        return 0;
}

SEC(".struct_ops.link")
struct klp_bpf_cmdline_ops cmdline_ops = {
        .set_cmdline = (void *)set_cmdline,
};

^ permalink raw reply

* [PATCH] uprobes: clear extra_consumers before pooling return instances
From: Keenan Dong @ 2026-04-08 10:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mhiramat, oleg
  Cc: mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	james.clark, andrii, linux-perf-users, linux-kernel,
	linux-trace-kernel, Keenan Dong

ri_pool_push() returns a return_instance to the per-task pool for
later reuse. The pool reset clears cons_cnt, but it leaves
extra_consumers behind.

A reused return_instance can later grow a fresh extra_consumers
array and then reach the cleanup path with a stale pointer from its
previous lifetime, leading to a double free of the recycled object.

Free and clear extra_consumers before putting the instance back into
the pool so every reused entry starts from a clean state.

Fixes: 8622e45b5da1 ("uprobes: Reuse return_instances between multiple uretprobes within task")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
---
 kernel/events/uprobes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 923b24b321cc..24b9884a2667 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1945,6 +1945,8 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 
 static void ri_pool_push(struct uprobe_task *utask, struct return_instance *ri)
 {
+	kfree(ri->extra_consumers);
+	ri->extra_consumers = NULL;
 	ri->cons_cnt = 0;
 	ri->next = utask->ri_pool;
 	utask->ri_pool = ri;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Steven Price @ 2026-04-08 11:01 UTC (permalink / raw)
  To: Vishal Annapurve, Michael Roth
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Jason Gunthorpe,
	Vlastimil Babka, kvm, linux-kernel, linux-trace-kernel, linux-doc,
	linux-kselftest, linux-mm
In-Reply-To: <CAGtprH-kgRByFvvCYeWMXtsvpb6qpaWAo8k-3PEnioyPg-LEvA@mail.gmail.com>

On 07/04/2026 22:50, Vishal Annapurve wrote:
> On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
>>
>>> TLDR:
>>>
>>> + Think of populate ioctls not as KVM touching memory, but platform
>>>   handling population.
>>> + KVM code (kvm_gmem_populate) still doesn't touch memory contents
>>> + post_populate is platform-specific code that handles loading into
>>>   private destination memory just to support legacy non-in-place
>>>   conversion.
>>> + Don't complicate populate ioctls by doing conversion just to support
>>>   legacy use-cases where platform-specific code has to do copying on
>>>   the host.
>>
>> That's a good point: these are only considerations in the context of
>> actually copying from src->dst, but with in-place conversion the
>> primary/more-performant approach will be for userspace to initial
>> directly. I.e. if we enforced that, then gmem could right ascertain that
>> it isn't even writing to private pages via these hooks and any
>> manipulation of that memory is purely on the part of the trusted entity
>> handling initial encryption/etc.
>>
>> I understand that we decided to keep the option of allowing separate
>> src/dst even with in-place conversion, but it doesn't seem worthwhile if
>> that necessarily means we need to glue population+conversion together in
>> 1 clumsy interface that needs to handle partial return/error responses to
>> userspace (or potentially get stuck forever in the conversion path).
> 
> I think ARM needs userspace to specify separate source and destination
> memory ranges for initial population as ARM doesn't support in-place
> memory encryption. [1]

Indeed - CCA requires KVM to first "delegate" the page (effectively the
shared->private conversion) which will destroy the contents. Then we can
populate the data (but that obviously has to come from elsewhere).

The closest CCA can do to an in-place conversion is for the kernel to
copy the data to another temporary buffer and then the firmware can copy
it back after the delegation. An early version of the CCA Linux patches
did this (long before guest_memfd). However this is slower than it needs
to be (two copies) and difficult to size the temporary buffer. Too small
and you round-trip to the firmware more than you need to, too large and
you waste memory. And, with increasing support for huge pages in
guest_memfd and the CCA firmware (aka RMM), it's also challenging to
preserve huge pages while doing this dance so I want to avoid it if
possible.

> [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
> 
>>
>> So I agree with Ackerley's proposal (which I guess is the same as what's
>> in this series).
>>
>> However, 1 other alternative would be to do what was suggested on the
>> call, but require userspace to subsequently handle the shared->private
>> conversion. I think that would be workable too.
> 
> IIUC, Converting memory ranges to private after it essentially is
> treated as private by the KVM CC backend will expose the
> implementation to the same risk of userspace being able to access
> private memory and compromise host safety which guest_memfd was
> invented to address.

At least in the Arm CCA case the "exposure" of the private memory is
only in terms of allowing population - and only before the guest has
run. The host isn't able to access the memory in any direct way after
the memory has been delegated. But the RMM provides this populate method
to copy data into memory (in a measured/controlled manner).

From a CCA perspective the logical flow is to mark the memory as private
and then call the platform-specific function to populate the memory. But
obviously we can fit in a KVM API which is different.

Note that CCA has a specific VM property called 'RIPAS' (Realm IPA
State). This is the guest's view of whether memory exists at a
particular physical address. My current series takes the view that all
guest_memfd memory is private RAM and the guest will have to
specifically request that it is converted to shared. I'm hoping this
series might provide a way for the VMM to configure this (before the
guest starts executing).

Thanks,
Steve

>>
>> One other benefit to Ackerley's/current approach however is that it allows
>> us to potentially keep hugepages intact in the populate path, since
>> prep'ing/encrypting everything while it's in a shared state means gmem will
>> split the hugepage and all the firmware/RMP/etc. data structures will only
>> be able to handle individual 4K pages. I still suspect doing things like
>> encoding the initial 2MB OVMF image as a single hugepage might yield
>> enough benefit to explore this (at some point). So there's some niceness
>> in knowing that Ackerley's approach would allow for that eventually and
>> not require a complete rethink on these same topics.
>>
>> Thanks,
>>
>> Mike
>>
>>>
>>>>>>
>>>>>> [...snip...]
>>>>>>

^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Petr Mladek @ 2026-04-08 11:10 UTC (permalink / raw)
  To: Song Liu
  Cc: Yafang Shao, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <CAPhsuW7JhtbniZHFWGMrzeqdS=-EjCySFPgiOBv0zKJNRwzONA@mail.gmail.com>

On Tue 2026-04-07 16:09:39, Song Liu wrote:
> On Tue, Apr 7, 2026 at 8:08 AM Petr Mladek <pmladek@suse.com> wrote:
> [...]
> > > + * @replace:   replace tag:
> > > + *             = 0: Atomic replace is disabled; however, this patch remains
> > > + *                  eligible to be superseded by others.
> >
> > This is weird semantic. Which livepatch tag would be allowed to
> > supersede it, please?
> >
> > Do we still need this category?
> >
> > > + *             > 0: Atomic replace is enabled. Only existing patches with a
> > > + *                  matching replace tag will be superseded.
> > >   * @list:      list node for global list of actively used patches
> > >   * @kobj:      kobject for sysfs resources
> > >   * @obj_list:  dynamic list of the object entries
> > > @@ -137,7 +141,7 @@ struct klp_patch {
> > >         struct module *mod;
> > >         struct klp_object *objs;
> > >         struct klp_state *states;
> > > -       bool replace;
> > > +       unsigned int replace;
> >
> > This already breaks the backward compatibility by changing the type
> > and semantic of this field.
> 
> I was thinking if replace=0 means no replace, it is still backward
> compatible. Did I miss something?

IMHO, the semantic of the no-replace mode would be strange if
we introduce the hybrid mode. Especially, it would be strange when
it can be replaced by any livepatch with random replace tag/set.
Also it would just complicate the definition and detection of conflicts.

I am going to provide more details in the reply to Yafang.

Best Regards,
Petr

^ permalink raw reply

* [RFC v6 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
	linux-trace-kernel, linux-kernel

Hi,

(This RFC v6 series is rebased onto linux-next master as of 2026-04-08,
commit f3e6330d7fe4 ("Add linux-next specific files for 20260407").)

Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.

Zhang Yi suggested two possible directions to address the root cause:

1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory."

2. "Alternatively, recording the mapped range of tracking might also be
feasible."

This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.

Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().

I did not fully implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).

Lock inversion / deadlock model (before):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... hold i_data_sem (A)             mutex_lock(s_fc_lock) (B)
    ext4_fc_track_inode()             ext4_fc_write_inode_data()
      mutex_lock(s_fc_lock) (B)         ext4_map_blocks()
      wait FC_COMMITTING (sleep)          down_read(i_data_sem) (A)

This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.

New model (this series):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... maybe hold i_data_sem (A)        jbd2_journal_lock_updates()
    ext4_fc_track_*()                 snapshot inode + ranges (no map_blocks)
      mutex_lock(s_fc_lock) (B)       jbd2_journal_unlock_updates()
      if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
      no sleep                         write FC log from snapshots only
                                    cleanup: clear COMMITTING, requeue if set

The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaining FC_COMMITTING waiter is ext4_fc_del(), which drops
s_fc_lock before sleeping.

This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.

Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.

ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.

The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.

To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.

Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.

Testing and measurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.

Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.

RFC v5 -> RFC v6:
- Rebase onto linux-next master as of 2026-04-08.
- Address tracepoint review feedback by relying on enum auto-increment for
  snap_err values and by switching the guarded ext4_fc_lock_updates call site
  to trace_call__ext4_fc_lock_updates() to avoid the double static_branch. [1]
- Keep lock window accounting unconditional for fc_info while using the guarded
  direct tracepoint call.
- Fix the inode debug print format exposed by the rebase.

RFC v4 -> RFC v5:
- Patch 6: Make ext4_fc_lock_updates snap_err human readable via
  TRACE_DEFINE_ENUM() + __print_symbolic(), using a single TRACE_SNAP_ERR
  mapping while keeping the enum values stable for tooling.

RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
  ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
  during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
  inode image and mapped ranges (recording the mapped range of tracking as
  suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
  ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
  commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
  and snapshot fallback reasons.

RFC v2 -> RFC v3:
- rebase on top of
  https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u

RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix

RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/
RFC v4: https://lore.kernel.org/all/20260120112538.132774-1-me@linux.beauty/
RFC v5: https://lore.kernel.org/all/20260317084624.457185-1-me@linux.beauty/t/#u

[1]: https://lore.kernel.org/all/acZJl8QUYEq8voqQ@BLRRASHENOY1.amd.com/T/#u

Thanks,

Li Chen (7):
  ext4: fast commit: snapshot inode state before writing log
  ext4: lockdep: handle i_data_sem subclassing for special inodes
  ext4: fast commit: avoid waiting for FC_COMMITTING
  ext4: fast commit: avoid self-deadlock in inode snapshotting
  ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
    snapshots
  ext4: fast commit: add lock_updates tracepoint
  ext4: fast commit: export snapshot stats in fc_info

 fs/ext4/ext4.h              |  73 +++-
 fs/ext4/fast_commit.c       | 706 +++++++++++++++++++++++++++++-------
 fs/ext4/inode.c             |  51 +++
 fs/ext4/super.c             |   9 +
 include/trace/events/ext4.h |  61 ++++
 5 files changed, 766 insertions(+), 134 deletions(-)

-- 
2.53.0

^ permalink raw reply

* [RFC v6 1/7] ext4: fast commit: snapshot inode state before writing log
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Fix the inode debug print format after rebasing.

 fs/ext4/ext4.h        |  22 ++-
 fs/ext4/fast_commit.c | 331 +++++++++++++++++++++++++++++++++++-------
 fs/ext4/inode.c       |  51 +++++++
 3 files changed, 352 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0cf68f85dfd1..98857292c707 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1024,6 +1024,7 @@ enum {
 	I_DATA_SEM_EA
 };
 
+struct ext4_fc_inode_snap;
 
 /*
  * fourth extended file system inode data in memory
@@ -1080,6 +1081,22 @@ struct ext4_inode_info {
 	/* End of lblk range that needs to be committed in this fast commit */
 	ext4_lblk_t i_fc_lblk_len;
 
+	/*
+	 * Commit-time fast commit snapshots.
+	 *
+	 * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast
+	 * commit log writing path reads the snapshot under sbi->s_fc_lock while
+	 * serializing fast commit TLVs.
+	 *
+	 * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the
+	 * corresponding cleanup / eviction paths.
+	 *
+	 * i_fc_snap points to per-inode snapshot data for fast commit:
+	 * - a raw inode snapshot for EXT4_FC_TAG_INODE
+	 * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE
+	 */
+	struct ext4_fc_inode_snap *i_fc_snap;
+
 	spinlock_t i_raw_lock;	/* protects updates to the raw inode */
 
 	/* Fast commit wait queue for this inode */
@@ -3083,8 +3100,9 @@ extern int  ext4_file_getattr(struct mnt_idmap *, const struct path *,
 			      struct kstat *, u32, unsigned int);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
-extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
-extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 390f25dee2b1..dc19795dacdd 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -55,21 +55,23 @@
  *     deleted while it is being flushed.
  * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA"
  *     state.
- * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
- *     all the exsiting handles finish and no new handles can start.
- * [4] Mark all the fast commit eligible inodes as undergoing fast commit
- *     by setting "EXT4_STATE_FC_COMMITTING" state.
- * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensures
+ *     that all the existing handles finish and no new handles can start.
+ * [4] Mark all the fast commit eligible inodes as undergoing fast commit by
+ *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
+ *     needed for log writing.
+ * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
  *     starting of new handles. If new handles try to start an update on
  *     any of the inodes that are being committed, ext4_fc_track_inode()
  *     will block until those inodes have finished the fast commit.
  * [6] Commit all the directory entry updates in the fast commit space.
- * [7] Commit all the changed inodes in the fast commit space and clear
- *     "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
  *     section for more details).
+ * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in
+ *     ext4_fc_cleanup().
  *
- * All the inode updates must be enclosed within jbd2_jounrnal_start()
+ * All the inode updates must be enclosed within jbd2_journal_start()
  * and jbd2_journal_stop() similar to JBD2 journaling.
  *
  * Fast Commit Ineligibility
@@ -199,6 +201,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	unlock_buffer(bh);
 }
 
+static void ext4_fc_free_inode_snap(struct inode *inode);
+
 static inline void ext4_fc_reset_inode(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -215,6 +219,7 @@ void ext4_fc_init_inode(struct inode *inode)
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
+	ei->i_fc_snap = NULL;
 	init_waitqueue_head(&ei->i_fc_wait);
 }
 
@@ -240,6 +245,7 @@ void ext4_fc_del(struct inode *inode)
 
 	alloc_ctx = ext4_fc_lock(inode->i_sb);
 	if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
+		ext4_fc_free_inode_snap(inode);
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
@@ -281,6 +287,7 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
@@ -845,6 +852,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 	return true;
 }
 
+struct ext4_fc_range {
+	struct list_head list;
+	u16 tag;
+	ext4_lblk_t lblk;
+	ext4_lblk_t len;
+	ext4_fsblk_t pblk;
+	bool unwritten;
+};
+
+struct ext4_fc_inode_snap {
+	struct list_head data_list;
+	unsigned int inode_len;
+	u8 inode_buf[];
+};
+
 /*
  * Writes inode in the fast commit space under TLV with tag @tag.
  * Returns 0 on success, error on failure.
@@ -852,21 +874,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
-	int ret;
-	struct ext4_iloc iloc;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
+	u8 *src;
+	int inode_len;
+	int ret;
 
-	ret = ext4_get_inode_loc(inode, &iloc);
-	if (ret)
-		return ret;
+	if (!snap)
+		return -ECANCELED;
 
-	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
-		inode_len = EXT4_INODE_SIZE(inode->i_sb);
-	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
-		inode_len += ei->i_extra_isize;
+	src = snap->inode_buf;
+	inode_len = snap->inode_len;
+	if (!src || inode_len == 0)
+		return -ECANCELED;
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -882,10 +904,9 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	dst += EXT4_FC_TAG_BASE_LEN;
 	memcpy(dst, &fc_inode, sizeof(fc_inode));
 	dst += sizeof(fc_inode);
-	memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	memcpy(dst, src, inode_len);
 	ret = 0;
 err:
-	brelse(iloc.bh);
 	return ret;
 }
 
@@ -895,12 +916,74 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
  */
 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
-	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_map_blocks map;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
+	struct ext4_fc_range *range;
+
+	if (!snap)
+		return -ECANCELED;
+
+	list_for_each_entry(range, &snap->data_list, list) {
+		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(range->lblk);
+			lrange.fc_len = cpu_to_le32(range->len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					     sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+			continue;
+		}
+
+		fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+		ex = (struct ext4_extent *)&fc_ext.fc_ex;
+		ex->ee_block = cpu_to_le32(range->lblk);
+		ex->ee_len = cpu_to_le16(range->len);
+		ext4_ext_store_pblock(ex, range->pblk);
+		if (range->unwritten)
+			ext4_ext_mark_unwritten(ex);
+		else
+			ext4_ext_mark_initialized(ex);
+
+		if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+				     sizeof(fc_ext), (u8 *)&fc_ext, crc))
+			return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static void ext4_fc_free_ranges(struct list_head *head)
+{
+	struct ext4_fc_range *range, *range_n;
+
+	list_for_each_entry_safe(range, range_n, head, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+static void ext4_fc_free_inode_snap(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+
+	if (!snap)
+		return;
+
+	ext4_fc_free_ranges(&snap->data_list);
+	kfree(snap);
+	ei->i_fc_snap = NULL;
+}
+
+static int ext4_fc_snapshot_inode_data(struct inode *inode,
+				       struct list_head *ranges)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	struct ext4_map_blocks map;
 	int ret;
 
 	spin_lock(&ei->i_fc_lock);
@@ -908,18 +991,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 		spin_unlock(&ei->i_fc_lock);
 		return 0;
 	}
-	old_blk_size = ei->i_fc_lblk_start;
-	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	start_lblk = ei->i_fc_lblk_start;
+	end_lblk = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
 	ei->i_fc_lblk_len = 0;
 	spin_unlock(&ei->i_fc_lock);
 
-	cur_lblk_off = old_blk_size;
-	ext4_debug("will try writing %d to %d for inode %llu\n",
-		   cur_lblk_off, new_blk_size, inode->i_ino);
+	cur_lblk = start_lblk;
+	ext4_debug("snapshot data ranges %u-%u for inode %llu\n",
+		   start_lblk, end_lblk,
+		   (unsigned long long)inode->i_ino);
+
+	while (cur_lblk <= end_lblk) {
+		struct ext4_fc_range *range;
 
-	while (cur_lblk_off <= new_blk_size) {
-		map.m_lblk = cur_lblk_off;
-		map.m_len = new_blk_size - cur_lblk_off + 1;
+		map.m_lblk = cur_lblk;
+		map.m_len = end_lblk - cur_lblk + 1;
 		ret = ext4_map_blocks(NULL, inode, &map,
 				      EXT4_GET_BLOCKS_IO_SUBMIT |
 				      EXT4_EX_NOCACHE);
@@ -927,17 +1013,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			return -ECANCELED;
 
 		if (map.m_len == 0) {
-			cur_lblk_off++;
+			cur_lblk++;
 			continue;
 		}
 
+		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (!range)
+			return -ENOMEM;
+
+		range->lblk = map.m_lblk;
+		range->len = map.m_len;
+		range->pblk = 0;
+		range->unwritten = false;
+
 		if (ret == 0) {
-			lrange.fc_ino = cpu_to_le32(inode->i_ino);
-			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
-			lrange.fc_len = cpu_to_le32(map.m_len);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
-					    sizeof(lrange), (u8 *)&lrange, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_DEL_RANGE;
 		} else {
 			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
 				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
@@ -945,26 +1035,67 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			/* Limit the number of blocks in one extent */
 			map.m_len = min(max, map.m_len);
 
-			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
-			ex = (struct ext4_extent *)&fc_ext.fc_ex;
-			ex->ee_block = cpu_to_le32(map.m_lblk);
-			ex->ee_len = cpu_to_le16(map.m_len);
-			ext4_ext_store_pblock(ex, map.m_pblk);
-			if (map.m_flags & EXT4_MAP_UNWRITTEN)
-				ext4_ext_mark_unwritten(ex);
-			else
-				ext4_ext_mark_initialized(ex);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
-					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_ADD_RANGE;
+			range->len = map.m_len;
+			range->pblk = map.m_pblk;
+			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
 		}
 
-		cur_lblk_off += map.m_len;
+		INIT_LIST_HEAD(&range->list);
+		list_add_tail(&range->list, ranges);
+
+		cur_lblk += map.m_len;
 	}
 
 	return 0;
 }
 
+static int ext4_fc_snapshot_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap;
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	struct ext4_iloc iloc;
+	LIST_HEAD(ranges);
+	int ret;
+	int alloc_ctx;
+
+	ret = ext4_get_inode_loc_noio(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
+		inode_len = EXT4_INODE_SIZE(inode->i_sb);
+	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
+	if (!snap) {
+		brelse(iloc.bh);
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&snap->data_list);
+	snap->inode_len = inode_len;
+
+	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	brelse(iloc.bh);
+
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	if (ret) {
+		kfree(snap);
+		ext4_fc_free_ranges(&ranges);
+		return ret;
+	}
+
+	alloc_ctx = ext4_fc_lock(inode->i_sb);
+	ext4_fc_free_inode_snap(inode);
+	ei->i_fc_snap = snap;
+	list_splice_tail_init(&ranges, &snap->data_list);
+	ext4_fc_unlock(inode->i_sb, alloc_ctx);
+
+	return 0;
+}
+
 
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
@@ -1015,6 +1146,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 		 */
 		if (list_empty(&fc_dentry->fcd_dilist))
 			continue;
+		/*
+		 * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created
+		 * inode's i_fc_dilist list (kept singular), so we can recover the
+		 * inode through it.
+		 */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				struct ext4_inode_info, i_fc_dilist);
 		inode = &ei->vfs_inode;
@@ -1039,6 +1175,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
+static int ext4_fc_snapshot_inodes(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode **inodes;
+	unsigned int nr_inodes = 0;
+	unsigned int i = 0;
+	int ret = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	if (!nr_inodes)
+		return 0;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		inodes[i] = igrab(&iter->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		inodes[i] = igrab(&ei->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+		if (ret)
+			break;
+	}
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		if (inodes[nr_inodes])
+			iput(inodes[nr_inodes]);
+	}
+	kvfree(inodes);
+	return ret;
+}
+
 static int ext4_fc_perform_commit(journal_t *journal)
 {
 	struct super_block *sb = journal->j_private;
@@ -1111,7 +1329,11 @@ static int ext4_fc_perform_commit(journal_t *journal)
 				     EXT4_STATE_FC_COMMITTING);
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
+
+	ret = ext4_fc_snapshot_inodes(journal);
 	jbd2_journal_unlock_updates(journal);
+	if (ret)
+		return ret;
 
 	/*
 	 * Step 5: If file system device is different from journal device,
@@ -1308,6 +1530,7 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
+		ext4_fc_free_inode_snap(&ei->vfs_inode);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, ei->i_sync_tid)) {
@@ -1343,6 +1566,14 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					     struct ext4_fc_dentry_update,
 					     fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
+		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
+		    !list_empty(&fc_dentry->fcd_dilist)) {
+			/* See the comment in ext4_fc_commit_dentry_updates(). */
+			ei = list_first_entry(&fc_dentry->fcd_dilist,
+					      struct ext4_inode_info,
+					      i_fc_dilist);
+			ext4_fc_free_inode_snap(&ei->vfs_inode);
+		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
 		release_dentry_name_snapshot(&fc_dentry->fcd_name);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 64dba7679245..478c81e6d08b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4939,6 +4939,57 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 	return ret;
 }
 
+/*
+ * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_loc().
+ * It looks up the inode table block in the buffer cache and returns -EAGAIN if
+ * the block is not present or not uptodate, without starting any I/O.
+ */
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *bh;
+	ext4_fsblk_t block;
+	int inodes_per_block, inode_offset;
+	unsigned long ino = inode->i_ino;
+
+	iloc->bh = NULL;
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+		return -EFSCORRUPTED;
+
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/* Figure out the offset within the block group inode table. */
+	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+	inode_offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb));
+	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+	block = ext4_inode_table(sb, gdp);
+	if (block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) ||
+	    block >= ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+		ext4_error(sb,
+			   "Invalid inode table block %llu in block_group %u",
+			   block, iloc->block_group);
+		return -EFSCORRUPTED;
+	}
+	block += inode_offset / inodes_per_block;
+
+	bh = sb_find_get_block(sb, block);
+	if (!bh)
+		return -EAGAIN;
+	if (!ext4_buffer_uptodate(bh)) {
+		brelse(bh);
+		return -EAGAIN;
+	}
+
+	iloc->bh = bh;
+	return 0;
+}
+
 
 int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc)
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
  without changing the subclassing logic.

 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 98857292c707..66de888ae411 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1016,12 +1016,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 578508eb4f1a..286f05834900 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1425,6 +1425,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5904,6 +5907,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 111 ++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 66de888ae411..13fe4fdf9bda 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1995,6 +1995,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index dc19795dacdd..5ed884cc4b5c 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -61,9 +61,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -217,6 +216,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -251,22 +251,30 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 #if (BITS_PER_LONG < 64)
 		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -287,19 +295,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +382,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -557,8 +570,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -577,29 +588,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-#if (BITS_PER_LONG < 64)
-		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_state_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#else
-		DEFINE_WAIT_BIT(wait, &ei->i_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#endif
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1526,32 +1519,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 #if (BITS_PER_LONG < 64)
-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox