* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: Andrew Morton @ 2026-05-08 23:47 UTC (permalink / raw)
To: qiwu.chen
Cc: rostedt, mhiramat, hannes, david, mhocko, willy,
linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>
On Wed, 6 May 2026 16:36:52 +0800 "qiwu.chen" <qiwuchen55@gmail.com> wrote:
> From: "qiwu.chen" <qiwuchen55@gmail.com>
> Signed-off-by: qiwu.chen <qiwu.chen@transsion.com>
Which should we use? If it's the transsion.com address (which I
assumed) then this can be communicated by placing an explicit From:
line at start-of-changelog.
^ permalink raw reply
* Re: [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs
From: kernel test robot @ 2026-05-09 0:12 UTC (permalink / raw)
To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: llvm, oe-kbuild-all, bvanassche, johannes.thumshirn, kch, dlemoal,
ritesh.list, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-3-atomlin@atomlin.com>
Hi Aaron,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe/for-next]
[also build test ERROR on trace/for-next linus/master v7.1-rc2 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Tomlin/blk-mq-add-tracepoint-block_rq_tag_wait/20260428-013259
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link: https://lore.kernel.org/r/20260427020142.358912-3-atomlin%40atomlin.com
patch subject: [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260509/202605090233.BZVa3x9s-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260509/202605090233.BZVa3x9s-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605090233.BZVa3x9s-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from block/blk-core.c:45:
In file included from include/trace/events/block.h:726:
In file included from include/trace/define_trace.h:132:
In file included from include/trace/trace_events.h:468:
>> include/trace/events/block.h:255:47: error: expected ':'
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
| :
include/trace/stages/stage6_event_callback.h:133:33: note: expanded from macro 'TP_fast_assign'
133 | #define TP_fast_assign(args...) args
| ^
include/trace/trace_events.h:44:16: note: expanded from macro 'TRACE_EVENT'
44 | PARAMS(assign), \
| ^
include/linux/tracepoint.h:160:25: note: expanded from macro 'PARAMS'
160 | #define PARAMS(args...) args
| ^
include/trace/trace_events.h:435:16: note: expanded from macro 'DECLARE_EVENT_CLASS'
435 | PARAMS(assign), PARAMS(print)) \
| ^
include/linux/tracepoint.h:160:25: note: expanded from macro 'PARAMS'
160 | #define PARAMS(args...) args
| ^
include/trace/trace_events.h:427:4: note: expanded from macro '\
__DECLARE_EVENT_CLASS'
427 | { assign; } \
| ^
include/trace/events/block.h:255:27: note: to match this '?'
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
>> include/trace/events/block.h:255:47: error: expected expression
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
In file included from block/blk-core.c:45:
In file included from include/trace/events/block.h:726:
In file included from include/trace/define_trace.h:133:
In file included from include/trace/perf.h:110:
>> include/trace/events/block.h:255:47: error: expected ':'
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
| :
include/trace/stages/stage6_event_callback.h:133:33: note: expanded from macro 'TP_fast_assign'
133 | #define TP_fast_assign(args...) args
| ^
include/trace/trace_events.h:44:16: note: expanded from macro 'TRACE_EVENT'
44 | PARAMS(assign), \
| ^
include/linux/tracepoint.h:160:25: note: expanded from macro 'PARAMS'
160 | #define PARAMS(args...) args
| ^
include/trace/perf.h:67:16: note: expanded from macro 'DECLARE_EVENT_CLASS'
67 | PARAMS(assign), PARAMS(print)) \
| ^
include/linux/tracepoint.h:160:25: note: expanded from macro 'PARAMS'
160 | #define PARAMS(args...) args
| ^
include/trace/perf.h:51:4: note: expanded from macro '\
__DECLARE_EVENT_CLASS'
51 | { assign; } \
| ^
include/trace/events/block.h:255:27: note: to match this '?'
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
>> include/trace/events/block.h:255:47: error: expected expression
255 | __entry->dev = q->disk ? disk_devt(q->disk);
| ^
4 errors generated.
vim +255 include/trace/events/block.h
242
243 TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
244
245 TP_ARGS(q, hctx, is_sched_tag),
246
247 TP_STRUCT__entry(
248 __field( dev_t, dev )
249 __field( u32, hctx_id )
250 __field( u32, nr_tags )
251 __field( bool, is_sched_tag )
252 ),
253
254 TP_fast_assign(
> 255 __entry->dev = q->disk ? disk_devt(q->disk);
256 __entry->hctx_id = hctx->queue_num;
257 __entry->is_sched_tag = is_sched_tag;
258
259 if (is_sched_tag)
260 __entry->nr_tags = hctx->sched_tags->nr_tags;
261 else
262 __entry->nr_tags = hctx->tags->nr_tags;
263 ),
264
265 TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
266 MAJOR(__entry->dev), MINOR(__entry->dev),
267 __entry->hctx_id,
268 __entry->is_sched_tag ? "scheduler" : "hardware",
269 __entry->nr_tags)
270 );
271
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Andrii Nakryiko @ 2026-05-09 0:30 UTC (permalink / raw)
To: bpf
Cc: linux-trace-kernel, jolsa, oleg, peterz, mingo, mhiramat,
Andrii Nakryiko
The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
probe site with a CALL into a uprobe trampoline. CALL pushes a return
address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
user code may keep temporary data without adjusting rsp.
Use a 5-byte JMP instead. JMP does not write to the user stack, but it
also does not provide a return address. Replace the single trampoline
entry with a page of 16-byte slots. Each optimized probe jumps to its
assigned slot, the slot moves rsp below the red zone, saves the registers
clobbered by syscall, and invokes the uprobe syscall:
Probe site: jmp slot_N (5B, replaces nop5)
Slot N: lea -128(%rsp), %rsp (5B) skip red zone
push %rcx (1B) save (syscall clobbers)
push %r11 (2B) save (syscall clobbers)
push %rax (1B) save (syscall uses for nr)
mov $336, %eax (5B) uprobe syscall number
syscall (2B)
All slots contain identical code at different offsets, so the trampoline
page is generated once at boot and mapped read-execute into each process.
The syscall handler identifies the slot from regs->ip, which points just
after the syscall instruction, and uses a per-mm slot table to recover the
original probe address.
The uprobe syscall does not return to the trampoline slot. The handler
restores the probe-site register state, runs the uprobe consumers, sets
pt_regs to continue at probe_addr + 5 unless a consumer redirected
execution, and returns directly through the IRET path. This preserves
general purpose registers, including rcx and r11, without requiring any
post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
shadow stack concerns.
Protect the per-mm trampoline list with RCU and free trampoline metadata
with kfree_rcu(). This lets the syscall path resolve trampoline slots
without taking mmap_lock. The optimized-instruction detection path also
walks the trampoline list under an RCU read-side lock. Since that path
starts from the JMP target, it translates the slot start to the post-syscall
IP expected by the shared resolver before checking the trampoline mapping.
Each trampoline page provides 256 slots. Slots stay permanently assigned
to their first probe address and are reused only when the same address is
probed again. Reassigning detached slots is deliberately avoided because a
thread can remain in a trampoline for an unbounded time due to ptrace,
interrupts, or scheduling delays. If a reachable trampoline page runs out
of slots, probes that cannot allocate a slot fall back to the slower INT3
path.
Require the entire trampoline page to be reachable by a rel32 JMP before
reusing it for a probe. This keeps every slot in the page within the range
that can be encoded at the probe site.
Change the error code returned when the uprobe syscall is invoked outside
a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
similar libraries distinguish fixed kernels from kernels with the
red-zone-clobbering implementation and enable nop5 optimization only on
fixed kernels.
Performance (usdt single-thread, M/s):
usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
Skylake 3.149 6.422 4.865 -24.3% 39.1%
Milan 2.910 3.443 3.820 +11.0% 24.3%
Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
Bergamo 3.393 3.895 3.849 -1.2% 24.5%
The fixed nop5 path remains faster than the non-optimized INT3 path on all
measured systems. The regression relative to the old CALL-based trampoline
comes from IRET being more expensive than SYSRET, most noticeably on older
Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
and AMD Milan improves because removing mmap_lock from the hot path more
than offsets the IRET cost.
Multi-threaded throughput scales nearly linearly with the number of CPUs, like
it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
arch/x86/include/asm/uprobes.h | 18 ++
arch/x86/kernel/uprobes.c | 262 ++++++++++--------
tools/lib/bpf/features.c | 8 +-
.../selftests/bpf/prog_tests/uprobe_syscall.c | 5 +-
tools/testing/selftests/bpf/prog_tests/usdt.c | 2 +-
5 files changed, 181 insertions(+), 114 deletions(-)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 362210c79998..a7cf5c92d95a 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -25,6 +25,24 @@ enum {
ARCH_UPROBE_FLAG_OPTIMIZE_FAIL = 1,
};
+/*
+ * Trampoline page layout: identical 16-byte slots, each containing:
+ * lea -128(%rsp), %rsp (5B) skip red zone
+ * push %rcx (1B) save (syscall clobbers)
+ * push %r11 (2B) save (syscall clobbers)
+ * push %rax (1B) save (syscall uses for nr)
+ * mov $336, %eax (5B) uprobe syscall number
+ * syscall (2B)
+ * = 16B, no padding needed
+ *
+ * The handler identifies which probe fired from regs->ip (each
+ * slot is at a unique offset), looks up the probe address from a
+ * per-process table, and returns directly to probe_addr+5 via iret
+ * with all registers restored.
+ */
+#define UPROBE_TRAMP_SLOT_SIZE 16
+#define UPROBE_TRAMP_MAX_SLOTS (PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
+
struct uprobe_xol_ops;
struct arch_uprobe {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..7e1f14200bbb 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
struct uprobe_trampoline {
struct hlist_node node;
+ struct rcu_head rcu;
unsigned long vaddr;
+ unsigned long probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
};
-static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
+
+static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
{
- long delta = (long)(vaddr + 5 - vtramp);
+ long delta = (long)(dst - (src + JMP32_INSN_SIZE));
return delta >= INT_MIN && delta <= INT_MAX;
}
+static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
+{
+ return is_reachable_by_jmp(vtramp, vaddr) &&
+ is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
+}
+
static unsigned long find_nearest_trampoline(unsigned long vaddr)
{
struct vm_unmapped_area_info info = {
@@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
return tramp;
}
+static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
+{
+ int i;
+
+ for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
+ if (tramp->probe_addrs[i] == probe_addr)
+ return i;
+ if (tramp->probe_addrs[i] == 0) {
+ tramp->probe_addrs[i] = probe_addr;
+ return i;
+ }
+ }
+ return -ENOSPC;
+}
+
static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
{
struct uprobes_state *state = ¤t->mm->uprobes_state;
@@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
return NULL;
hlist_for_each_entry(tramp, &state->head_tramps, node) {
- if (is_reachable_by_call(tramp->vaddr, vaddr)) {
+ if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
*new = false;
return tramp;
}
@@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
return NULL;
*new = true;
- hlist_add_head(&tramp->node, &state->head_tramps);
+ hlist_add_head_rcu(&tramp->node, &state->head_tramps);
return tramp;
}
@@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
* because there's no easy way to make sure none of the threads
* is still inside the trampoline.
*/
- hlist_del(&tramp->node);
- kfree(tramp);
+ hlist_del_rcu(&tramp->node);
+ kfree_rcu(tramp, rcu);
}
void arch_uprobe_init_state(struct mm_struct *mm)
@@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
destroy_uprobe_trampoline(tramp);
}
-static bool __in_uprobe_trampoline(unsigned long ip)
+/*
+ * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
+ * resolve the slot index from @ip and return the probe address.
+ *
+ * @ip is expected to point right after the syscall instruction, i.e.,
+ * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
+ */
+static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
{
- struct vm_area_struct *vma = vma_lookup(current->mm, ip);
+ struct uprobes_state *state = ¤t->mm->uprobes_state;
+ struct uprobe_trampoline *tramp;
- return vma && vma_is_special_mapping(vma, &tramp_mapping);
-}
+ hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
+ /*
+ * ip points to after syscall, so it's on 16 byte boundary,
+ * which means that valid ip can point right after the page
+ * and should never be at zero offset within the page
+ */
+ if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
+ continue;
-static bool in_uprobe_trampoline(unsigned long ip)
-{
- struct mm_struct *mm = current->mm;
- bool found, retry = true;
- unsigned int seq;
+ if (probe_addr) {
+ /* we already validated ip is within expected range */
+ unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
+ unsigned long addr = tramp->probe_addrs[slot];
- rcu_read_lock();
- if (mmap_lock_speculate_try_begin(mm, &seq)) {
- found = __in_uprobe_trampoline(ip);
- retry = mmap_lock_speculate_retry(mm, seq);
- }
- rcu_read_unlock();
+ *probe_addr = addr;
+ if (addr == 0)
+ return false;
+ }
- if (retry) {
- mmap_read_lock(mm);
- found = __in_uprobe_trampoline(ip);
- mmap_read_unlock(mm);
+ return true;
}
- return found;
+ return false;
+}
+
+static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
+{
+ guard(rcu)();
+ return resolve_uprobe_addr(ip, probe_addr);
}
/*
- * See uprobe syscall trampoline; the call to the trampoline will push
- * the return address on the stack, the trampoline itself then pushes
- * cx, r11 and ax.
+ * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
+ * before doing the uprobe syscall. No return address is pushed — the
+ * probe site uses jmp, not call.
*/
struct uprobe_syscall_args {
unsigned long ax;
unsigned long r11;
unsigned long cx;
- unsigned long retaddr;
};
+#define UPROBE_TRAMP_REDZONE 128
+
SYSCALL_DEFINE0(uprobe)
{
struct pt_regs *regs = task_pt_regs(current);
struct uprobe_syscall_args args;
- unsigned long ip, sp, sret;
+ unsigned long probe_addr;
int err;
/* Allow execution only from uprobe trampolines. */
- if (!in_uprobe_trampoline(regs->ip))
- return -ENXIO;
+ if (!in_uprobe_trampoline(regs->ip, &probe_addr))
+ return -EPROTO;
err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
if (err)
goto sigill;
- ip = regs->ip;
-
/*
- * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
- * - adjust ip to the probe address, call saved next instruction address
- * - adjust sp to the probe's stack frame (check trampoline code)
+ * Restore the register state as it was at the probe site:
+ * - ax/r11/cx from the trampoline-saved copies on user stack
+ * - adjust ip to the probe address based on matching slot
+ * - adjust sp to skip red zone and pushed args
*/
regs->ax = args.ax;
regs->r11 = args.r11;
regs->cx = args.cx;
- regs->ip = args.retaddr - 5;
- regs->sp += sizeof(args);
+ regs->ip = probe_addr;
+ regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
regs->orig_ax = -1;
- sp = regs->sp;
-
- err = shstk_pop((u64 *)&sret);
- if (err == -EFAULT || (!err && sret != args.retaddr))
- goto sigill;
-
- handle_syscall_uprobe(regs, regs->ip);
+ handle_syscall_uprobe(regs, probe_addr);
/*
- * Some of the uprobe consumers has changed sp, we can do nothing,
- * just return via iret.
+ * Skip the jmp instruction at the probe site (5 bytes) unless
+ * a consumer redirected execution elsewhere.
*/
- if (regs->sp != sp) {
- /* skip the trampoline call */
- if (args.retaddr - 5 == regs->ip)
- regs->ip += 5;
- return regs->ax;
- }
+ if (regs->ip == probe_addr)
+ regs->ip = probe_addr + 5;
- regs->sp -= sizeof(args);
-
- /* for the case uprobe_consumer has changed ax/r11/cx */
- args.ax = regs->ax;
- args.r11 = regs->r11;
- args.cx = regs->cx;
-
- /* keep return address unless we are instructed otherwise */
- if (args.retaddr - 5 != regs->ip)
- args.retaddr = regs->ip;
-
- if (shstk_push(args.retaddr) == -EFAULT)
- goto sigill;
-
- regs->ip = ip;
-
- err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
- if (err)
- goto sigill;
-
- /* ensure sysret, see do_syscall_64() */
- regs->r11 = regs->flags;
- regs->cx = regs->ip;
- return 0;
+ /*
+ * Return via iret by returning regs->ax. This preserves all
+ * GP registers (including cx and r11) without needing any
+ * user-space cleanup code. The iret path is used because we
+ * don't set up cx/r11 for sysret.
+ */
+ return regs->ax;
sigill:
force_sig(SIGILL);
return -1;
}
+/*
+ * All uprobe trampoline slots are identical: skip the red zone,
+ * save the three registers that syscall clobbers, then invoke
+ * the uprobe syscall. The handler returns directly to the probe
+ * caller via iret. Execution never returns to the trampoline.
+ */
asm (
".pushsection .rodata\n"
- ".balign " __stringify(PAGE_SIZE) "\n"
- "uprobe_trampoline_entry:\n"
+ ".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
+ "uprobe_trampoline_slot:\n"
+ "lea -128(%rsp), %rsp\n"
"push %rcx\n"
"push %r11\n"
"push %rax\n"
- "mov $" __stringify(__NR_uprobe) ", %rax\n"
+ "mov $" __stringify(__NR_uprobe) ", %eax\n"
"syscall\n"
- "pop %rax\n"
- "pop %r11\n"
- "pop %rcx\n"
- "ret\n"
- "int3\n"
- ".balign " __stringify(PAGE_SIZE) "\n"
+ "uprobe_trampoline_slot_end:\n"
".popsection\n"
);
-extern u8 uprobe_trampoline_entry[];
+extern u8 uprobe_trampoline_slot[];
+extern u8 uprobe_trampoline_slot_end[];
static int __init arch_uprobes_init(void)
{
- tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
+ unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
+ struct page *page;
+ u8 *page_addr;
+ int i;
+
+ BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
+ WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ page_addr = page_address(page);
+ for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
+ memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
+
+ tramp_mapping_pages[0] = page;
return 0;
}
@@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
enum {
EXPECT_SWBP,
- EXPECT_CALL,
+ EXPECT_JMP,
};
struct write_opcode_ctx {
@@ -917,14 +947,14 @@ struct write_opcode_ctx {
int expect;
};
-static int is_call_insn(uprobe_opcode_t *insn)
+static int is_jmp_insn(uprobe_opcode_t *insn)
{
- return *insn == CALL_INSN_OPCODE;
+ return *insn == JMP32_INSN_OPCODE;
}
/*
* Verification callback used by int3_update uprobe_write calls to make sure
- * the underlying instruction is as expected - either int3 or call.
+ * the underlying instruction is as expected - either int3 or jmp.
*/
static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
int nbytes, void *data)
@@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
if (is_swbp_insn(&old_opcode[0]))
return 1;
break;
- case EXPECT_CALL:
- if (is_call_insn(&old_opcode[0]))
+ case EXPECT_JMP:
+ if (is_jmp_insn(&old_opcode[0]))
return 1;
break;
}
@@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
* so we can skip this step for optimize == true.
*/
if (!optimize) {
- ctx.expect = EXPECT_CALL;
+ ctx.expect = EXPECT_JMP;
err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
true /* is_register */, false /* do_update_ref_ctr */,
&ctx);
@@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
}
static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
- unsigned long vaddr, unsigned long tramp)
+ unsigned long vaddr, unsigned long slot_vaddr)
{
- u8 call[5];
+ u8 jmp[5];
- __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
- (const void *) tramp, CALL_INSN_SIZE);
- return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+ __text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
+ (const void *) slot_vaddr, JMP32_INSN_SIZE);
+ return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
}
static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
@@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
struct __packed __arch_relative_insn {
u8 op;
s32 raddr;
- } *call = (struct __arch_relative_insn *) insn;
+ } *jmp = (struct __arch_relative_insn *) insn;
- if (!is_call_insn(insn))
+ if (!is_jmp_insn(&jmp->op))
return false;
- return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+
+ guard(rcu)();
+ /*
+ * resolve_uprobe_addr() expects IP pointing after syscall instruction
+ * (after the slot, basically), so adjust jump target address accordingly
+ */
+ return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
}
static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
@@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
{
struct uprobe_trampoline *tramp;
struct vm_area_struct *vma;
+ unsigned long slot_vaddr;
bool new = false;
- int err = 0;
+ int slot, err;
vma = find_vma(mm, vaddr);
if (!vma)
@@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
tramp = get_uprobe_trampoline(vaddr, &new);
if (!tramp)
return -EINVAL;
- err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
- if (WARN_ON_ONCE(err) && new)
+
+ slot = tramp_alloc_slot(tramp, vaddr);
+ if (slot < 0) {
+ if (new)
+ destroy_uprobe_trampoline(tramp);
+ return slot;
+ }
+
+ slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
+ err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
+ if (err && new)
destroy_uprobe_trampoline(tramp);
return err;
}
diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
index 4f19a0d79b0c..1b6c113357b2 100644
--- a/tools/lib/bpf/features.c
+++ b/tools/lib/bpf/features.c
@@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
static int probe_uprobe_syscall(int token_fd)
{
/*
- * If kernel supports uprobe() syscall, it will return -ENXIO when called
- * from the outside of a kernel-generated uprobe trampoline.
+ * If kernel supports uprobe() syscall, it will return -EPROTO when
+ * called from outside a kernel-generated uprobe trampoline.
+ * Older kernels with the red-zone-clobbering bug return -ENXIO;
+ * we only enable the nop5 optimization on fixed kernels.
*/
- return syscall(__NR_uprobe) < 0 && errno == ENXIO;
+ return syscall(__NR_uprobe) < 0 && errno == EPROTO;
}
#else
static int probe_uprobe_syscall(int token_fd)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 955a37751b52..0d5eb4cd1ddf 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
/* .. and check the trampoline is as expected. */
call = (struct __arch_relative_insn *) addr;
tramp = (void *) (call + 1) + call->raddr;
- ASSERT_EQ(call->op, 0xe8, "call");
+ tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
+ ASSERT_EQ(call->op, 0xe9, "jmp");
ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
return tramp;
@@ -762,7 +763,7 @@ static void test_uprobe_error(void)
long err = syscall(__NR_uprobe);
ASSERT_EQ(err, -1, "error");
- ASSERT_EQ(errno, ENXIO, "errno");
+ ASSERT_EQ(errno, EPROTO, "errno");
}
static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 69759b27794d..9d3744d4e936 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
ASSERT_EQ(*addr_2, 0x90, "nop");
/* call is on addr_2 + 1 address */
- ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
+ ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
ASSERT_EQ(skel->bss->executed, 4, "executed");
cleanup:
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering
From: Andrii Nakryiko @ 2026-05-09 0:30 UTC (permalink / raw)
To: bpf
Cc: linux-trace-kernel, jolsa, oleg, peterz, mingo, mhiramat,
Andrii Nakryiko
In-Reply-To: <20260509003146.976844-1-andrii@kernel.org>
The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte
CALL to a trampoline. The CALL pushes a return address onto the stack at
[rsp-8], clobbering whatever was stored there.
On x86-64, the red zone is the 128 bytes below rsp that user code may use
for temporary storage without adjusting rsp. Compilers can place USDT
argument operands there, generating specs like "8@-8(%rbp)" when rbp ==
rsp. With the CALL-based optimization, the return address overwrites that
argument before the BPF-side USDT argument fetch runs.
Add two tests for this case. The uprobe_syscall subtest stores known values
at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop5 uprobe,
and verifies the red-zone data is still intact. The USDT subtest triggers a
probe in a function where the compiler places three USDT operands in the
red zone and verifies that all 10 optimized invocations deliver the expected
argument values to BPF.
On an unfixed kernel, the first hit goes through the INT3 path and later
hits use the optimized CALL path, so the red-zone checks fail after
optimization.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 75 ++++++++++++++++++-
tools/testing/selftests/bpf/prog_tests/usdt.c | 46 ++++++++++++
tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++++
tools/testing/selftests/bpf/usdt_2.c | 13 ++++
4 files changed, 158 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 0d5eb4cd1ddf..6c651e4ff49a 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -357,6 +357,46 @@ __nocf_check __weak void usdt_test(void)
USDT(optimized_uprobe, usdt);
}
+/*
+ * Assembly-level red zone clobbering test. Stores known values in the
+ * red zone (below RSP), executes a nop5 (uprobe site), and checks that
+ * the values survived. Returns 0 if intact, 1 if clobbered.
+ *
+ * If the nop5 optimization uses CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) gets overwritten.
+ */
+__attribute__((aligned(16)))
+__nocf_check __weak __naked unsigned long uprobe_red_zone_test(void)
+{
+ asm volatile (
+ "movabs $0x1111111111111111, %%rax\n"
+ "movq %%rax, -8(%%rsp)\n"
+ "movabs $0x2222222222222222, %%rax\n"
+ "movq %%rax, -16(%%rsp)\n"
+ "movabs $0x3333333333333333, %%rax\n"
+ "movq %%rax, -24(%%rsp)\n"
+
+ ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n" /* nop5: uprobe site */
+
+ "movabs $0x1111111111111111, %%rax\n"
+ "cmpq %%rax, -8(%%rsp)\n"
+ "jne 1f\n"
+ "movabs $0x2222222222222222, %%rax\n"
+ "cmpq %%rax, -16(%%rsp)\n"
+ "jne 1f\n"
+ "movabs $0x3333333333333333, %%rax\n"
+ "cmpq %%rax, -24(%%rsp)\n"
+ "jne 1f\n"
+
+ "xorl %%eax, %%eax\n"
+ "retq\n"
+ "1:\n"
+ "movl $1, %%eax\n"
+ "retq\n"
+ ::: "rax", "memory"
+ );
+}
+
static int find_uprobes_trampoline(void *tramp_addr)
{
void *start, *end;
@@ -394,7 +434,7 @@ static void *find_nop5(void *fn)
{
int i;
- for (i = 0; i < 10; i++) {
+ for (i = 0; i < 128; i++) {
if (!memcmp(nop5, fn + i, 5))
return fn + i;
}
@@ -758,6 +798,37 @@ static void test_uprobe_race(void)
#define __NR_uprobe 336
#endif
+static void test_uprobe_red_zone(void)
+{
+ struct uprobe_syscall_executed *skel;
+ struct bpf_link *link;
+ void *nop5_addr;
+ size_t offset;
+ int i;
+
+ nop5_addr = find_nop5(uprobe_red_zone_test);
+ if (!ASSERT_NEQ(nop5_addr, NULL, "find_nop5"))
+ return;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ offset = get_uprobe_offset(nop5_addr);
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "attach_uprobe"))
+ goto cleanup;
+
+ for (i = 0; i < 10; i++)
+ ASSERT_EQ(uprobe_red_zone_test(), 0, "red_zone_intact");
+
+ bpf_link__destroy(link);
+
+cleanup:
+ uprobe_syscall_executed__destroy(skel);
+}
+
static void test_uprobe_error(void)
{
long err = syscall(__NR_uprobe);
@@ -784,6 +855,8 @@ static void __test_uprobe_syscall(void)
test_uprobe_usdt();
if (test__start_subtest("uprobe_race"))
test_uprobe_race();
+ if (test__start_subtest("uprobe_red_zone"))
+ test_uprobe_red_zone();
if (test__start_subtest("uprobe_error"))
test_uprobe_error();
if (test__start_subtest("uprobe_regs_equal"))
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 9d3744d4e936..5e607773d5cc 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -250,6 +250,7 @@ static void subtest_basic_usdt(bool optimized)
#ifdef __x86_64__
extern void usdt_1(void);
extern void usdt_2(void);
+extern void usdt_red_zone_trigger(void);
static unsigned char nop1[1] = { 0x90 };
static unsigned char nop1_nop5_combo[6] = { 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
@@ -335,6 +336,49 @@ static void subtest_optimized_attach(void)
cleanup:
test_usdt__destroy(skel);
}
+
+/*
+ * Test that USDT arguments survive nop5 optimization in a function where
+ * the compiler places operands in the red zone.
+ *
+ * Signal handlers are prone to having the compiler place USDT argument
+ * operands in the red zone (below rsp). When nop5 is optimized to a
+ * call instruction, the call pushes the return address to [rsp-8],
+ * potentially clobbering the argument.
+ */
+static void subtest_optimized_red_zone(void)
+{
+ struct test_usdt *skel;
+ int i;
+
+ skel = test_usdt__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ skel->bss->expected_arg[0] = 0xDEADBEEF;
+ skel->bss->expected_arg[1] = 0xCAFEBABE;
+ skel->bss->expected_arg[2] = 0xFEEDFACE;
+ skel->bss->expected_pid = getpid();
+
+ skel->links.usdt_check_arg = bpf_program__attach_usdt(
+ skel->progs.usdt_check_arg, 0, "/proc/self/exe",
+ "optimized_attach", "usdt_red_zone", NULL);
+ if (!ASSERT_OK_PTR(skel->links.usdt_check_arg, "attach_usdt_red_zone"))
+ goto cleanup;
+
+ for (i = 0; i < 10; i++)
+ usdt_red_zone_trigger();
+
+ ASSERT_EQ(skel->bss->arg_total, 10, "arg_total");
+ ASSERT_EQ(skel->bss->arg_bad, 0, "arg_bad");
+ ASSERT_EQ(skel->bss->arg_last[0], 0xDEADBEEF, "arg_last_1");
+ ASSERT_EQ(skel->bss->arg_last[1], 0xCAFEBABE, "arg_last_2");
+ ASSERT_EQ(skel->bss->arg_last[2], 0xFEEDFACE, "arg_last_3");
+
+cleanup:
+ test_usdt__destroy(skel);
+}
+
#endif
unsigned short test_usdt_100_semaphore SEC(".probes");
@@ -608,6 +652,8 @@ void test_usdt(void)
subtest_basic_usdt(true);
if (test__start_subtest("optimized_attach"))
subtest_optimized_attach();
+ if (test__start_subtest("optimized_red_zone"))
+ subtest_optimized_red_zone();
#endif
if (test__start_subtest("multispec"))
subtest_multispec_usdt();
diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
index f00cb52874e0..0ee78fb050a1 100644
--- a/tools/testing/selftests/bpf/progs/test_usdt.c
+++ b/tools/testing/selftests/bpf/progs/test_usdt.c
@@ -149,5 +149,30 @@ int usdt_executed(struct pt_regs *ctx)
executed++;
return 0;
}
+
+int arg_total;
+int arg_bad;
+long arg_last[3];
+long expected_arg[3];
+int expected_pid;
+
+SEC("usdt")
+int BPF_USDT(usdt_check_arg, long arg1, long arg2, long arg3)
+{
+ if (expected_pid != (bpf_get_current_pid_tgid() >> 32))
+ return 0;
+
+ __sync_fetch_and_add(&arg_total, 1);
+ arg_last[0] = arg1;
+ arg_last[1] = arg2;
+ arg_last[2] = arg3;
+
+ if (arg1 != expected_arg[0] ||
+ arg2 != expected_arg[1] ||
+ arg3 != expected_arg[2])
+ __sync_fetch_and_add(&arg_bad, 1);
+
+ return 0;
+}
#endif
char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index 789883aaca4c..fc7e6d220a38 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -13,4 +13,17 @@ void usdt_2(void)
USDT(optimized_attach, usdt_2);
}
+static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
+static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
+static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
+
+void __attribute__((noinline)) usdt_red_zone_trigger(void)
+{
+ unsigned long a1 = usdt_red_zone_arg1;
+ unsigned long a2 = usdt_red_zone_arg2;
+ unsigned long a3 = usdt_red_zone_arg3;
+
+ USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
+}
+
#endif
--
2.53.0-Meta
^ permalink raw reply related
* Re: [PATCH v1 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-05-09 1:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, Greg Kroah-Hartman,
Jiri Slaby, Konrad Dybcio, linux-kernel, linux-trace-kernel,
linux-arm-msm, linux-serial, Mukesh Kumar Savaliya,
Aniket Randive, chandana.chiluveru, jyothi.seerapu
In-Reply-To: <20260508132543.4f100ae0@gandalf.local.home>
On 08-05-2026 22:55, Steven Rostedt wrote:
> On Wed, 06 May 2026 22:54:44 +0530
> Praveen Talari <praveen.talari@oss.qualcomm.com> wrote:
>
>> +TRACE_EVENT(geni_serial_tx_data,
>> + TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
>> + TP_ARGS(dev, buf, len),
>> +
>> + TP_STRUCT__entry(__string(name, dev_name(dev))
>> + __field(unsigned int, len)
>> + __dynamic_array(u8, data, len)
>> + ),
>> +
>> + TP_fast_assign(__assign_str(name);
>> + __entry->len = len;
>> + memcpy(__get_dynamic_array(data), buf, len);
>> + ),
>> +
>> + TP_printk("%s: tx_len=%u data=%s",
>> + __get_str(name), __entry->len,
>> + __print_hex(__get_dynamic_array(data), __entry->len))
>> +);
>> +
>> +TRACE_EVENT(geni_serial_rx_data,
>> + TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
>> + TP_ARGS(dev, buf, len),
>> +
>> + TP_STRUCT__entry(__string(name, dev_name(dev))
>> + __field(unsigned int, len)
>> + __dynamic_array(u8, data, len)
>> + ),
>> +
>> + TP_fast_assign(__assign_str(name);
>> + __entry->len = len;
>> + memcpy(__get_dynamic_array(data), buf, len);
>> + ),
>> +
>> + TP_printk("%s: rx_len=%u data=%s",
> Do you really need to say "tx_len" and "rx_len", could it just be "len" and
> have the name of the tracepoint show what it is?
Sure. I will review and update next patch.
>
> Each TRACE_EVENT() is really just a:
>
> DECLARE_EVENT_CLASS() followed by a DEFINE_EVENT()
>
> underneath.
>
> And each TRACE_EVENT() costs around 5K in size, where most of that is in
> the DECLARE_EVENT_CLASS() portion. Thus, you can save some memory by using
> DECLARE_EVENT_CLASS() and then define the above two events with
> DEFINE_EVENT().
Thank you for suggestion and will update in next patch.
Thanks,
Praveen Talari
>
> -- Steve
>
>
>> + __get_str(name), __entry->len,
>> + __print_hex(__get_dynamic_array(data), __entry->len))
>> +);
>> +
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: Andrew Morton @ 2026-05-09 1:29 UTC (permalink / raw)
To: qiwu.chen
Cc: rostedt, mhiramat, hannes, david, mhocko, willy,
linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>
On Wed, 6 May 2026 16:36:52 +0800 "qiwu.chen" <qiwuchen55@gmail.com> wrote:
> Currently, reclaim_flags always contains RECLAIM_WB_ASYNC in lru_shrink
> tracepoints since commit 41ac1999c3e35 ("mm: vmscan: do not stall on
> writeback during memory compaction"), which is useless for debugging
> memory pressure issues. Other RECLAIM_WB_* flags are not used anywhere
> else, so they can be directly removed.
> This patch reworks the lru_shrink and write_folio tracepoints for better
> correlation and analysis:
> - traces each folio lru type instead of reclaim_flags.
> - traces each lru_shrink with reason.
> - remove the printing of the unnecessary PFN for mm_vmscan_write_folio.
Applying this to 7.1-rc1, my x86_64 allmodconfig blew up.
In file included from ./include/trace/define_trace.h:132,
from ./include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
./include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
./include/trace/events/vmscan.h:358:19: error: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Werror=format=]
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
219 | trace_event_printf(iter, print); \
| ^~~~~
./include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
45 | PARAMS(print)); \
| ^~~~~~
./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
./include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~
In file included from ./include/trace/trace_events.h:256:
./include/trace/events/vmscan.h:358:27: note: format string is defined here
358 | TP_printk("folio=%p lru=%s",
| ~^
| |
| void *
| %ld
./include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
./include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
./include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
427 | { assign; } \
| ^~~~~~
./include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
435 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
./include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
./include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
In file included from ./include/trace/define_trace.h:133:
./include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
./include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
./include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
51 | { assign; } \
| ^~~~~~
./include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
67 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
./include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
./include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[3]: *** [scripts/Makefile.build:289: mm/vmscan.o] Error 1
make[2]: *** [scripts/Makefile.build:548: mm] Error 2
make[1]: *** [/usr/src/25/Makefile:2141: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-09 2:07 UTC (permalink / raw)
To: Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <af3spostNgoRU0Vv@sirena.co.uk>
On 08-05-2026 19:31, Mark Brown wrote:
> On Thu, May 07, 2026 at 11:03:39PM +0530, Praveen Talari wrote:
>> On 07-05-2026 13:43, Mark Brown wrote:
>>> By generic I mean this should not be driver specific at all.
>> I hope these changes are fine. Please let me know if you have any concerns
>> or feedback.
> The data tracepoints look plausible but I would expect them to be
> generated by the core, they'll be there for everything so I'd expect
Thank you for the clarification. I now understand your point clearly.
Could you also please review the changes made in spi.c ?
I would appreciate any feedback or suggestions you may have.
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 91dd831d2d3b..f0d3665412fe 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1658,6 +1658,11 @@ static int spi_transfer_one_message(struct
spi_controller *ctlr,
trace_spi_transfer_stop(msg, xfer);
+ if (spi_valid_txbuf(msg, xfer))
+ trace_spi_tx_data(msg->spi, xfer->tx_buf,
xfer->len);
+ if (spi_valid_rxbuf(msg, xfer))
+ trace_spi_rx_data(msg->spi, xfer->rx_buf,
xfer->len);
+
if (msg->status != -EINPROGRESS)
goto out;
Thanks,
Praveen Talari
> them to work for everything.
^ permalink raw reply related
* [PATCH] mm/damon: fix incorrect field name in damos_stat_after_apply_interval
From: Yu Qin @ 2026-05-09 3:49 UTC (permalink / raw)
To: sj, rostedt, mhiramat; +Cc: damon, linux-trace-kernel, Yu Qin
Fix the format string to use 'sz_applied' instead of 'sz_tried' to correctly
represent the applied size.
Signed-off-by: Yu Qin <qin.yuA@h3c.com>
---
include/trace/events/damon.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 24fc402ab..7e25f4469 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -41,7 +41,7 @@ TRACE_EVENT(damos_stat_after_apply_interval,
),
TP_printk("ctx_idx=%u scheme_idx=%u nr_tried=%lu sz_tried=%lu "
- "nr_applied=%lu sz_tried=%lu sz_ops_filter_passed=%lu "
+ "nr_applied=%lu sz_applied=%lu sz_ops_filter_passed=%lu "
"qt_exceeds=%lu nr_snapshots=%lu",
__entry->context_idx, __entry->scheme_idx,
__entry->nr_tried, __entry->sz_tried,
--
2.33.0
^ permalink raw reply related
* [PATCH v2 2/4] docs: fix repeated word 'that' across documentation
From: Adrien Reynard @ 2026-05-09 14:30 UTC (permalink / raw)
To: paulmck, corbet, gregkh, dhowells, mhiramat
Cc: frederic, neeraj.upadhyay, joelagnelf, josh, boqun, urezki,
rostedt, skhan, rafael, dakr, pc, rcu, linux-doc, driver-core,
netfs, linux-fsdevel, linux-trace-kernel, linux-kernel,
Adrien Reynard
In-Reply-To: <20260508163759.16231-1-reynard.adrien.08@gmail.com>
Remove duplicated word 'that' found in RCU/rcu.rst,
driver-api/driver-model/overview.rst,
filesystems/netfs_library.rst, trace/histogram-design.rst
and trace/histogram.rst.
Signed-off-by: Adrien Reynard <reynard.adrien.08@gmail.com>
---
Documentation/RCU/rcu.rst | 2 +-
Documentation/driver-api/driver-model/overview.rst | 2 +-
Documentation/filesystems/netfs_library.rst | 2 +-
Documentation/trace/histogram-design.rst | 2 +-
Documentation/trace/histogram.rst | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/Documentation/RCU/rcu.rst b/Documentation/RCU/rcu.rst
index bf6617b330a7..110f3d42cace 100644
--- a/Documentation/RCU/rcu.rst
+++ b/Documentation/RCU/rcu.rst
@@ -32,7 +32,7 @@ Frequently Asked Questions
Just as with spinlocks, RCU readers are not permitted to
block, switch to user-mode execution, or enter the idle loop.
Therefore, as soon as a CPU is seen passing through any of these
- three states, we know that that CPU has exited any previous RCU
+ three states, we know that the CPU has exited any previous RCU
read-side critical sections. So, if we remove an item from a
linked list, and then wait until all CPUs have switched context,
executed in user mode, or executed in the idle loop, we can
diff --git a/Documentation/driver-api/driver-model/overview.rst b/Documentation/driver-api/driver-model/overview.rst
index b3f447bf9f07..4360cd5200be 100644
--- a/Documentation/driver-api/driver-model/overview.rst
+++ b/Documentation/driver-api/driver-model/overview.rst
@@ -55,7 +55,7 @@ struct pci_dev now looks like this::
Note first that the struct device dev within the struct pci_dev is
statically allocated. This means only one allocation on device discovery.
-Note also that that struct device dev is not necessarily defined at the
+Note also that the struct device dev is not necessarily defined at the
front of the pci_dev structure. This is to make people think about what
they're doing when switching between the bus driver and the global driver,
and to discourage meaningless and incorrect casts between the two.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index ddd799df6ce3..715218e1b233 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -626,7 +626,7 @@ A number of members are available for access/use by the filesystem:
These are set by the filesystem or the cache in ->prepare_read() or
->prepare_write() for each subrequest to indicate the maximum number of
- bytes and, optionally, the maximum number of segments (if not 0) that that
+ bytes and, optionally, the maximum number of segments (if not 0) that the
subrequest can support.
* ``submit_extendable_to``
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
index e92f56ebd0b5..c25587f411f2 100644
--- a/Documentation/trace/histogram-design.rst
+++ b/Documentation/trace/histogram-design.rst
@@ -738,7 +738,7 @@ creates its own variable, wakeup_lat, but nothing yet uses it::
Looking at the sched_waking 'hist_debug' output, in addition to the
normal key and value hist_fields, in the val fields section we see a
-field with the HIST_FIELD_FL_VAR flag, which indicates that that field
+field with the HIST_FIELD_FL_VAR flag, which indicates that the field
represents a variable. Note that in addition to the variable name,
contained in the var.name field, it includes the var.idx, which is the
index into the tracing_map_elt.vars[] array of the actual variable
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 340bcb5099e7..ca619499ce28 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -1700,7 +1700,7 @@ to that rule is that any variable used in an expression is essentially
'read-once' - once it's used by an expression in a subsequent event,
it's reset to its 'unset' state, which means it can't be used again
unless it's set again. This ensures not only that an event doesn't
-use an uninitialized variable in a calculation, but that that variable
+use an uninitialized variable in a calculation, but that the variable
is used only once and not for any unrelated subsequent match.
The basic syntax for saving a variable is to simply prefix a unique
--
2.54.0
^ permalink raw reply related
* Re: [PATCH] mm/damon: fix incorrect field name in damos_stat_after_apply_interval
From: SeongJae Park @ 2026-05-09 15:19 UTC (permalink / raw)
To: Yu Qin; +Cc: SeongJae Park, rostedt, mhiramat, damon, linux-trace-kernel
In-Reply-To: <20260509034951.3757618-1-qin.yuA@h3c.com>
On Sat, 9 May 2026 11:49:51 +0800 Yu Qin <qin.yuA@h3c.com> wrote:
> Fix the format string to use 'sz_applied' instead of 'sz_tried' to correctly
> represent the applied size.
Thank you for kindly fixing this, Yu!
However, same fix is already queued [1] for next merge window. So
unfortunately we may not able to merge this patch.
Looking forward to your next patch, though!
[1] https://lore.kernel.org/mm-commits/20260426200250.BE164C2BCAF@smtp.kernel.org/
Thanks,
SJ
[...]
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up
From: Gregory Price @ 2026-05-09 16:38 UTC (permalink / raw)
To: lsf-pc
Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, longman, akpm, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
matthew.brost, joshua.hahnjy, rakie.kim, byungchul, ying.huang,
apopple, axelrasmussen, yuanchu, weixugc, yury.norov, linux,
mhiramat, mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj,
baolin.wang, npache, ryan.roberts, dev.jain, baohua, lance.yang,
muchun.song, xu.xin16, chengming.zhou, jannh, linmiaohe,
nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel, harry.yoo,
cl, roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
zhengqi.arch, terry.bowman
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>
Just wanting to follow up post-conference with a few major take-aways
since I will be a bit sparse during May / Early June (so want to not
forget, and garner a bit of input on the notes).
If you just want the tl;dr:
0) naming: private -> managed
1) remove global general "possible" and "online" node lists
2) add consistency with "normal" nodes, by opting them all in
to all the new things, and just making that the new normal.
e.g.: node_is_private_managed -> node_is_lru_eligible
3) Have __init add init time nodes to all the lists
Otherwise service/owner must add/enable services.
4) Make folio checks just way more explicit per service
e.g.: folio_is_private_managed -> folio_is_ksm_eligible
5) I still think w/o __GFP_PRIVATE this will still be too fragile,
but we're going to give it a try.
6) No callbacks in the MVP
7) MVP will be, essentially, Buddy + MBind support
Otherwise, more notes below.
~Gregory
<wall of text>
0) Naming is hard. Willy and Liam expressed concern over "private".
We briefly discussed "Managed"
This results in the following changes:
- if (folio_is_zone_device(folio))
+ if (folio_is_managed(folio))
and
+ if (node_is_managed(nid))
and
- N_MEMORY_PRIVATE
+ N_MEMORY_MANAGED
I'm less enthused the last one, but i'm ok with it.
1) There is a desire to fix possible / online node masks to avoid
bad patterns, and maybe to audit existing nodemask users.
there's one UAPI issue with this, and that that these masks
are exposed to userland by nature of existing node attributes
(N_MEMORY, N_CPU, N_POSSIBLE, etc).
I'm considering a name change from `possible` -> `init`, because
that's mostly how it is used (initialize some set of per-node
resources during __init, not at runtime). Externally, this set
would still be reported to uapi as possible.
2) There was concern about inconsistency towards nodes.
Along the lines of #1 - I'm thinking about actually adding explicit
service nodelists, which are populated at boot by __init, and by
hotplug if it's a general purpose node.
So we'd end up with things like:
for_each_ksm_node
for_each_lru_node
for_each_x_node
And we would retire such general defines like
for_each_node
for_each_online_node
For any "normal" node, it lands in all the lists.
For the buddy, we would have
for_buddy_node
For the default buddy-node list, and otherwise "managed" nodes would
still be removed from the standard fallback lists.
This means these nodes cannot be reached via nodemask arguments, and
can only be reached by `alloc_pages_node(nid, ...)` nid argument.
I *think* might resolve __GFP_PRIVATE.
But it's still dependent on system-wide for_each good behavior.
3) How do private nodes get into the lists in the new system?
For any private node, the registering driver (owner) and the managing
service are responsible for adding/removing the nodes from the list.
Example workflow:
0) CXL driver hotplug: add_memory_driver_managed(..., nid, owner)
a) owner=NULL means general purpose node
b) otherwise, reserve nid and (pgdat->owner = owner)
1) hotplug memory onto the node
a) if node is normal, add to all service lists
b) if node is "managed" (private), omit from all lists
2) CXL driver registers node with specific services, e.g.:
cram_register_node(..., nid, owner);
3) Service sets node enabled in appropriate node list, and starts
any appropriate services (kswapd, kcompactd, etc) for that node.
In some cases, nodes would have individual mappings onto services
(cram), in other cases the intent would be to have the memory
otherwise treated as general-purpose, but with special access
patterns (e.g. an LRU node not marked N_MEMORY).
4) There are still concerns about random hooks around the kernel.
My thought is to make this less "random", and more a change
in the way we think about folio operations / node operations
for ALL nodes.
ZONE_DEVICE has a bunch of implicit filtering due to not being
on the LRU - but the intent is to allow flexible LRU membership.
So what if we just made these checks much more explict overall
if (folio_is_ksm_eligible(folio)) /* can be merged */
if (folio_is_lru_eligible(folio)) /* managed by lru services */
if (folio_is_demotion_eligible(folio)) /* demotion target */
if (folio_is_mbind_eligible(folio)) /* can be an mbind target */
Rather than rathole over what the set of bits should be, i think it's
more important to determine what the actual operation here will be.
right now I have this defined as essentially:
folio_pgdat(folio)->private.ops.mask & NP_OPT_KSM
But if we generalize to all nodes / all features, it's essentially
a per-pgdat bitmask lookup:
bool folio_is_ksm_eligible(folio)) {
return test_bit(N_FEATURE_KSM, folio_pgdat(folio)->features);
}
With the bonus that all ZONE_DEVICE hooks can be sunk into these
checks, so there are many places in mm/ where this becomes essentially
a single-line change.
5) Lacking __GFP_PRIVATE, I have concern over fragility.
Previously, __GFP_PRIVATE created a "default opt-out" mechanism.
I *think* the above nodelist changes, specifically removing:
for_each_node()
for_each_online_node()
for_each_node_with_cpus()
The problem I foresee is with existing node_state masks, like
node_state((node), N_POSSIBLE)
node_state((node), N_CPU)
This might be tractable, but it may also simply be too fragile.
Right now only 3 or 4 locations use node_state() outside mm/, and
I'm tempted to try to sink these into mm/internal.h instead of
include/linux/nodemask.h. If that becomes unpalletable, then I will
lobby for __GFP_PRIVATE again (I may still anyway :P).
6) No callbacks by default, but nothing technically prevents it.
I was already in the process of killing this. I think mmu_notifier
does *most* of what the callbacks where doing anyway, so we can
probably collapse that.
7) David asked me to limit the MVP to Buddy + MBind support.
There's some odd interactions with pagecache, so that might evolve
too (may not be able to reliably fault a file directly onto a private
node, tbd - mempolicy does not apply to page cache faults, so it's
just unreliable).
</wall of text>
~Gregory
^ permalink raw reply
* Re: [PATCH v2 2/4] docs: fix repeated word 'that' across documentation
From: Jonathan Corbet @ 2026-05-09 21:12 UTC (permalink / raw)
To: Adrien Reynard, paulmck, gregkh, dhowells, mhiramat
Cc: frederic, neeraj.upadhyay, joelagnelf, josh, boqun, urezki,
rostedt, skhan, rafael, dakr, pc, rcu, linux-doc, driver-core,
netfs, linux-fsdevel, linux-trace-kernel, linux-kernel,
Adrien Reynard
In-Reply-To: <20260509143050.16458-1-reynard.adrien.08@gmail.com>
Adrien Reynard <reynard.adrien.08@gmail.com> writes:
> Remove duplicated word 'that' found in RCU/rcu.rst,
> driver-api/driver-model/overview.rst,
> filesystems/netfs_library.rst, trace/histogram-design.rst
> and trace/histogram.rst.
>
> Signed-off-by: Adrien Reynard <reynard.adrien.08@gmail.com>
> ---
> Documentation/RCU/rcu.rst | 2 +-
> Documentation/driver-api/driver-model/overview.rst | 2 +-
> Documentation/filesystems/netfs_library.rst | 2 +-
> Documentation/trace/histogram-design.rst | 2 +-
> Documentation/trace/histogram.rst | 2 +-
> 5 files changed, 5 insertions(+), 5 deletions(-)
When you send an updated version of a patch, please:
- Post it as its own thread, rather than as a response to a previous
version
- Include a note after the "---" line explaining the changes from the
previous version.
Thanks,
jon
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: chenqiwu @ 2026-05-10 0:27 UTC (permalink / raw)
To: Andrew Morton
Cc: rostedt, mhiramat, hannes, david, mhocko, willy,
linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260508164743.ef0194a1223d457bd0e13494@linux-foundation.org>
On Fri, May 08, 2026 at 04:47:43PM -0700, Andrew Morton wrote:
> On Wed, 6 May 2026 16:36:52 +0800 "qiwu.chen" <qiwuchen55@gmail.com> wrote:
>
> > From: "qiwu.chen" <qiwuchen55@gmail.com>
> > Signed-off-by: qiwu.chen <qiwu.chen@transsion.com>
>
> Which should we use? If it's the transsion.com address (which I
> assumed) then this can be communicated by placing an explicit From:
> line at start-of-changelog.
>
gmail address is used for communication, transsion.com is used for SOB
which cannot communicated with msmtp normally.
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: chenqiwu @ 2026-05-10 0:36 UTC (permalink / raw)
To: Andrew Morton
Cc: rostedt, mhiramat, hannes, david, mhocko, willy,
linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260508182932.ee1931dec680db80a8a8e660@linux-foundation.org>
On Fri, May 08, 2026 at 06:29:32PM -0700, Andrew Morton wrote:
>
> Applying this to 7.1-rc1, my x86_64 allmodconfig blew up.
>
>
>
> In file included from ./include/trace/define_trace.h:132,
> from ./include/trace/events/vmscan.h:602,
> from mm/vmscan.c:72:
> ./include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
> ./include/trace/events/vmscan.h:358:19: error: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Werror=format=]
> 358 | TP_printk("folio=%p lru=%s",
> | ^~~~~~~~~~~~~~~~~
> ./include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
> 219 | trace_event_printf(iter, print); \
> | ^~~~~
> ./include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
> 45 | PARAMS(print)); \
> | ^~~~~~
> ./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
> 342 | TRACE_EVENT(mm_vmscan_write_folio,
> | ^~~~~~~~~~~
> ./include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
> 358 | TP_printk("folio=%p lru=%s",
> | ^~~~~~~~~
> In file included from ./include/trace/trace_events.h:256:
> ./include/trace/events/vmscan.h:358:27: note: format string is defined here
> 358 | TP_printk("folio=%p lru=%s",
> | ~^
> | |
> | void *
> | %ld
> ./include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
> ./include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
> 354 | __entry->folio = folio;
> | ^
> ./include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
> 427 | { assign; } \
> | ^~~~~~
> ./include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
> 435 | PARAMS(assign), PARAMS(print)) \
> | ^~~~~~
> ./include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
> 40 | DECLARE_EVENT_CLASS(name, \
> | ^~~~~~~~~~~~~~~~~~~
> ./include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
> 44 | PARAMS(assign), \
> | ^~~~~~
> ./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
> 342 | TRACE_EVENT(mm_vmscan_write_folio,
> | ^~~~~~~~~~~
> ./include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
> 353 | TP_fast_assign(
> | ^~~~~~~~~~~~~~
> In file included from ./include/trace/define_trace.h:133:
> ./include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
> ./include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
> 354 | __entry->folio = folio;
> | ^
> ./include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
> 51 | { assign; } \
> | ^~~~~~
> ./include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
> 67 | PARAMS(assign), PARAMS(print)) \
> | ^~~~~~
> ./include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
> 40 | DECLARE_EVENT_CLASS(name, \
> | ^~~~~~~~~~~~~~~~~~~
> ./include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
> 44 | PARAMS(assign), \
> | ^~~~~~
> ./include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
> 342 | TRACE_EVENT(mm_vmscan_write_folio,
> | ^~~~~~~~~~~
> ./include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
> 353 | TP_fast_assign(
> | ^~~~~~~~~~~~~~
> cc1: all warnings being treated as errors
> make[3]: *** [scripts/Makefile.build:289: mm/vmscan.o] Error 1
> make[2]: *** [scripts/Makefile.build:548: mm] Error 2
> make[1]: *** [/usr/src/25/Makefile:2141: .] Error 2
> make: *** [Makefile:248: __sub-make] Error 2
>
Sorry, I ignored the rookie mistake which I built pass with the aarch64-linux-gnu toolchain.
Could you please apply the following minor fix change based on this patch?
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -346,7 +346,7 @@ TRACE_EVENT(mm_vmscan_write_folio,
TP_ARGS(folio),
TP_STRUCT__entry(
- __field(unsigned long, folio)
+ __field(struct folio *, folio)
__field(int, lru)
),
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Mark Brown @ 2026-05-10 12:37 UTC (permalink / raw)
To: Praveen Talari
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <4d90b061-95ab-40d4-83d2-13425e992d4d@oss.qualcomm.com>
[-- Attachment #1: Type: text/plain, Size: 844 bytes --]
On Sat, May 09, 2026 at 07:37:26AM +0530, Praveen Talari wrote:
> Could you also please review the changes made in spi.c ?
> I would appreciate any feedback or suggestions you may have.
Please just sumbmit normal patches instead of sending partial patches in
reply to another thread unless something is really unclear.
> @@ -1658,6 +1658,11 @@ static int spi_transfer_one_message(struct
> spi_controller *ctlr,
>
> trace_spi_transfer_stop(msg, xfer);
>
> + if (spi_valid_txbuf(msg, xfer))
> + trace_spi_tx_data(msg->spi, xfer->tx_buf,
> xfer->len);
> + if (spi_valid_rxbuf(msg, xfer))
> + trace_spi_rx_data(msg->spi, xfer->rx_buf,
> xfer->len);
It feels like it'd be more helpful to log the transmit data before we do
the send.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH RFC v5 10/53] KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
From: Liam R. Howlett @ 2026-05-10 13:43 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgF9+Gr7UVEq-E2SQEb_XOQQMOXy9F_A2tA=DbNV_fJ0EQ@mail.gmail.com>
On 7 May 2026 12:56:11 GMT-04:00, Ackerley Tng <ackerleytng@google.com> wrote:
>"Liam R. Howlett" <liam@infradead.org> writes:
>
>> On 26/04/28 04:25PM, Ackerley Tng via B4 Relay wrote:
>>>
>>> [...snip...]
>>>
>>> +/*
>>> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
>>> + * by mas. Adjacent ranges with attributes identical to the new attributes
>>> + * will be merged. Also sets mas's bounds up for storing attributes.
>>> + *
>>> + * This maintains the invariant that ranges with the same attributes will
>>> + * always be merged.
>>> + */
>>> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
>>> + pgoff_t start, size_t nr_pages)
>>> +{
>>> + pgoff_t end = start + nr_pages;
>>> + pgoff_t last = end - 1;
>>> + void *entry;
>>> +
>>> + /* Try extending range. entry is NULL on overflow/wrap-around. */
>>> + mas_set_range(mas, end, end);
>>> + entry = mas_find(mas, end);
>
>Thank you for your reviews!
>
>>
>> Please read the documentation as I believe you have a bug here. What
>> happens if there is another range stored higher than end + 1?
>>
>
>The invariant in this maple tree is that contiguous ranges with the same
>attribute are stored as a single range.
>
>The goal of this first part is to get the entry at the index just after
>the requested range, and see what the attribute there is. If that
>attribute is what we're about to set, extend the requested range for
>storing to the end of that range.
>
>If there is another range higher than end + 1, with the invariant
>maintained, that attribute has to be different than the attribute stored
>at end. Hence, we only want to extend this requested range up till end.
>
mas_find() will look for an entry at the given address for the first search, and if it is not found it will continue to search upwards. Since you limit the search to end, it will work as you want and there isn't a bug as I was thinking in my sleep deprived state.
Since you are searching for exactly one address (end), it might serve you better to walk there. Maybe walking is a better API for what you are doing here?
>> Do you have testing of these functions somewhere?
>>
>
>GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4) tests setting
>attributes in ranges. If test_page is 2,
>
>1. [0, 4) starts off shared (4 is the number of pages in the guest_memfd)
>2. [2, 3) is converted to private
> => so the ranges should now be [0, 2), [2, 3), [3, 4)
>3. [2, 3) is converted back to shared
> => so the ranges should now be [0, 4)
>
>I verified this by inserting some trace_printk()s and inspecting manually.
>
Thanks. I find the exclusive ranges a bit odd to think about in the maple tree context, but this test case makes sense. This is especially odd to look at a single index entry, at least for me.
I generally have a set of test cases and append any bug reproduces to that list so they are unlikely to reoccur. My testing is certainly different from what you'll be doing, but this method has done well with the quality of code improving over time, and limited (if any) regressions.
I actually insist that any fix has a test before I accept them. There are two reasons for this: 1. Avoiding the regression. 2. People really understand the bug if they can create a reproducer.
I hope this helps.
>>> + if (entry && xa_to_value(entry) == attributes)
>>> + last = mas->last;
>>> +
>>> + if (start > 0) {
>>> + mas_set_range(mas, start - 1, start - 1);
>>> + entry = mas_find(mas, start - 1);
>>> + if (entry && xa_to_value(entry) == attributes)
>>> + start = mas->index;
>>> + }
>>> +
>>> + mas_set_range(mas, start, last);
>>> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>>> +}
>>> +
>>>
>>> [...snip...]
>>>
^ permalink raw reply
* Re: [PATCH 01/13] verification/rvgen: Switch LTL parser to Lark
From: Nam Cao @ 2026-05-10 18:18 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <41f74dec0fab9d49a9b77ec994b102ba73fe060d.camel@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> It looks very neat! I didn't go through it fully just yet, though.
> This one works fine but there's a nit: the ASTNode's id starts from 1, but
> apparently the new grammar consider RULE as a node too, this results in
> variables in the generated header file starting from val2 (rather than val1).
>
> Unless I missed something here, we should probably start from 0:
Yep, thanks!
> Also it doesn't gracefully handle an invalid syntax, but that's probably still a
> work in progress.
We could catch Lark's exception. I will look into it.
Nam
^ permalink raw reply
* Re: [PATCH 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Nam Cao @ 2026-05-10 18:21 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <dba16e4fe794c5aa237f3360fef5fc22ae32c60b.camel@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> Looks good, although I need to put more effort to understand the grammar.
> There's an issue parsing events though:
>
>> +class EventLabelParser:
>> + grammar = r'''
>> + events: event ("\\n" event)*
>> +
>> + event: name (";" guard)*
>> +
>> + guard: reset
>> + | rule
>> + | rule reset
>> + | reset rule
>
> I'm not sure if it could be solved better changing the grammar, but this doesn't
> work in case we have both a reset and a rule, e.g.:
>
> "event2;env1 == 0;reset(clk)"
>
> It apparently saves only one of them, the other would end up in args[2].
Thanks for pointing that out. The grammar is broken, it allows any
number of guards and does not require a ";" between reset and rule.
This grammar change fixes it:
diff --git a/tools/verification/rvgen/rvgen/automata.py b/tools/verification/rvgen/rvgen/automata.py
index cc42b8127fc0..d8c5e9028364 100644
--- a/tools/verification/rvgen/rvgen/automata.py
+++ b/tools/verification/rvgen/rvgen/automata.py
@@ -275,12 +275,12 @@ class EventLabelParser:
grammar = r'''
events: event ("\\n" event)*
- event: name (";" guard)*
+ event: name (";" guard)?
guard: reset
| rule
- | rule reset
- | reset rule
+ | rule ";" reset
+ | reset ";" rule
name: CNAME
@@ -312,6 +312,7 @@ class EventLabelParser:
return ConstraintCondition(*args)
def event(self, args):
+ assert(len(args) <= 2)
name = args[0]
rule, reset = None, None
if len(args) == 2:
^ permalink raw reply related
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-10 21:25 UTC (permalink / raw)
To: Andrii Nakryiko; +Cc: bpf, linux-trace-kernel, oleg, peterz, mingo, mhiramat
In-Reply-To: <20260509003146.976844-1-andrii@kernel.org>
On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> probe site with a CALL into a uprobe trampoline. CALL pushes a return
> address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> user code may keep temporary data without adjusting rsp.
>
> Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> also does not provide a return address. Replace the single trampoline
> entry with a page of 16-byte slots. Each optimized probe jumps to its
> assigned slot, the slot moves rsp below the red zone, saves the registers
> clobbered by syscall, and invokes the uprobe syscall:
>
> Probe site: jmp slot_N (5B, replaces nop5)
>
> Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> push %rcx (1B) save (syscall clobbers)
> push %r11 (2B) save (syscall clobbers)
> push %rax (1B) save (syscall uses for nr)
> mov $336, %eax (5B) uprobe syscall number
> syscall (2B)
>
> All slots contain identical code at different offsets, so the trampoline
> page is generated once at boot and mapped read-execute into each process.
> The syscall handler identifies the slot from regs->ip, which points just
> after the syscall instruction, and uses a per-mm slot table to recover the
> original probe address.
>
> The uprobe syscall does not return to the trampoline slot. The handler
> restores the probe-site register state, runs the uprobe consumers, sets
> pt_regs to continue at probe_addr + 5 unless a consumer redirected
> execution, and returns directly through the IRET path. This preserves
> general purpose registers, including rcx and r11, without requiring any
> post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> shadow stack concerns.
>
> Protect the per-mm trampoline list with RCU and free trampoline metadata
> with kfree_rcu(). This lets the syscall path resolve trampoline slots
> without taking mmap_lock. The optimized-instruction detection path also
> walks the trampoline list under an RCU read-side lock. Since that path
> starts from the JMP target, it translates the slot start to the post-syscall
> IP expected by the shared resolver before checking the trampoline mapping.
>
> Each trampoline page provides 256 slots. Slots stay permanently assigned
> to their first probe address and are reused only when the same address is
> probed again. Reassigning detached slots is deliberately avoided because a
> thread can remain in a trampoline for an unbounded time due to ptrace,
> interrupts, or scheduling delays. If a reachable trampoline page runs out
> of slots, probes that cannot allocate a slot fall back to the slower INT3
> path.
>
> Require the entire trampoline page to be reachable by a rel32 JMP before
> reusing it for a probe. This keeps every slot in the page within the range
> that can be encoded at the probe site.
>
> Change the error code returned when the uprobe syscall is invoked outside
> a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> similar libraries distinguish fixed kernels from kernels with the
> red-zone-clobbering implementation and enable nop5 optimization only on
> fixed kernels.
>
> Performance (usdt single-thread, M/s):
>
> usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> Skylake 3.149 6.422 4.865 -24.3% 39.1%
> Milan 2.910 3.443 3.820 +11.0% 24.3%
> Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> Bergamo 3.393 3.895 3.849 -1.2% 24.5%
>
> The fixed nop5 path remains faster than the non-optimized INT3 path on all
> measured systems. The regression relative to the old CALL-based trampoline
> comes from IRET being more expensive than SYSRET, most noticeably on older
> Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> and AMD Milan improves because removing mmap_lock from the hot path more
> than offsets the IRET cost.
>
> Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
hi,
thanks a lot for the fix
FWIW we discussed also an option to have 10-bytes nop and do:
[rsp+0x80, call trampoline]
we would not need the slots re-use logic, but not sure what other
surprises there are with 10-bytes nop
I tried that change [1], it seems to work, but it has other
difficulties, like I think the unoptimized path needs to do:
[rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
instead of patching back the 10-byte nop, because some thread
could be inside the nop area already.
jirka
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574
>
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
> arch/x86/include/asm/uprobes.h | 18 ++
> arch/x86/kernel/uprobes.c | 262 ++++++++++--------
> tools/lib/bpf/features.c | 8 +-
> .../selftests/bpf/prog_tests/uprobe_syscall.c | 5 +-
> tools/testing/selftests/bpf/prog_tests/usdt.c | 2 +-
> 5 files changed, 181 insertions(+), 114 deletions(-)
>
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 362210c79998..a7cf5c92d95a 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -25,6 +25,24 @@ enum {
> ARCH_UPROBE_FLAG_OPTIMIZE_FAIL = 1,
> };
>
> +/*
> + * Trampoline page layout: identical 16-byte slots, each containing:
> + * lea -128(%rsp), %rsp (5B) skip red zone
> + * push %rcx (1B) save (syscall clobbers)
> + * push %r11 (2B) save (syscall clobbers)
> + * push %rax (1B) save (syscall uses for nr)
> + * mov $336, %eax (5B) uprobe syscall number
> + * syscall (2B)
> + * = 16B, no padding needed
> + *
> + * The handler identifies which probe fired from regs->ip (each
> + * slot is at a unique offset), looks up the probe address from a
> + * per-process table, and returns directly to probe_addr+5 via iret
> + * with all registers restored.
> + */
> +#define UPROBE_TRAMP_SLOT_SIZE 16
> +#define UPROBE_TRAMP_MAX_SLOTS (PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
> +
> struct uprobe_xol_ops;
>
> struct arch_uprobe {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..7e1f14200bbb 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
>
> struct uprobe_trampoline {
> struct hlist_node node;
> + struct rcu_head rcu;
> unsigned long vaddr;
> + unsigned long probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
> };
>
> -static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> +
> +static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
> {
> - long delta = (long)(vaddr + 5 - vtramp);
> + long delta = (long)(dst - (src + JMP32_INSN_SIZE));
>
> return delta >= INT_MIN && delta <= INT_MAX;
> }
>
> +static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
> +{
> + return is_reachable_by_jmp(vtramp, vaddr) &&
> + is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
> +}
> +
> static unsigned long find_nearest_trampoline(unsigned long vaddr)
> {
> struct vm_unmapped_area_info info = {
> @@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> return tramp;
> }
>
> +static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
> +{
> + int i;
> +
> + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
> + if (tramp->probe_addrs[i] == probe_addr)
> + return i;
> + if (tramp->probe_addrs[i] == 0) {
> + tramp->probe_addrs[i] = probe_addr;
> + return i;
> + }
> + }
> + return -ENOSPC;
> +}
> +
> static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> {
> struct uprobes_state *state = ¤t->mm->uprobes_state;
> @@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> return NULL;
>
> hlist_for_each_entry(tramp, &state->head_tramps, node) {
> - if (is_reachable_by_call(tramp->vaddr, vaddr)) {
> + if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
> *new = false;
> return tramp;
> }
> @@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> return NULL;
>
> *new = true;
> - hlist_add_head(&tramp->node, &state->head_tramps);
> + hlist_add_head_rcu(&tramp->node, &state->head_tramps);
> return tramp;
> }
>
> @@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> * because there's no easy way to make sure none of the threads
> * is still inside the trampoline.
> */
> - hlist_del(&tramp->node);
> - kfree(tramp);
> + hlist_del_rcu(&tramp->node);
> + kfree_rcu(tramp, rcu);
> }
>
> void arch_uprobe_init_state(struct mm_struct *mm)
> @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> destroy_uprobe_trampoline(tramp);
> }
>
> -static bool __in_uprobe_trampoline(unsigned long ip)
> +/*
> + * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
> + * resolve the slot index from @ip and return the probe address.
> + *
> + * @ip is expected to point right after the syscall instruction, i.e.,
> + * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
> + */
> +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> {
> - struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> + struct uprobes_state *state = ¤t->mm->uprobes_state;
> + struct uprobe_trampoline *tramp;
>
> - return vma && vma_is_special_mapping(vma, &tramp_mapping);
> -}
> + hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
> + /*
> + * ip points to after syscall, so it's on 16 byte boundary,
> + * which means that valid ip can point right after the page
> + * and should never be at zero offset within the page
> + */
> + if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
> + continue;
>
> -static bool in_uprobe_trampoline(unsigned long ip)
> -{
> - struct mm_struct *mm = current->mm;
> - bool found, retry = true;
> - unsigned int seq;
> + if (probe_addr) {
> + /* we already validated ip is within expected range */
> + unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
> + unsigned long addr = tramp->probe_addrs[slot];
>
> - rcu_read_lock();
> - if (mmap_lock_speculate_try_begin(mm, &seq)) {
> - found = __in_uprobe_trampoline(ip);
> - retry = mmap_lock_speculate_retry(mm, seq);
> - }
> - rcu_read_unlock();
> + *probe_addr = addr;
> + if (addr == 0)
> + return false;
> + }
>
> - if (retry) {
> - mmap_read_lock(mm);
> - found = __in_uprobe_trampoline(ip);
> - mmap_read_unlock(mm);
> + return true;
> }
> - return found;
> + return false;
> +}
> +
> +static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
> +{
> + guard(rcu)();
> + return resolve_uprobe_addr(ip, probe_addr);
> }
>
> /*
> - * See uprobe syscall trampoline; the call to the trampoline will push
> - * the return address on the stack, the trampoline itself then pushes
> - * cx, r11 and ax.
> + * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
> + * before doing the uprobe syscall. No return address is pushed — the
> + * probe site uses jmp, not call.
> */
> struct uprobe_syscall_args {
> unsigned long ax;
> unsigned long r11;
> unsigned long cx;
> - unsigned long retaddr;
> };
>
> +#define UPROBE_TRAMP_REDZONE 128
> +
> SYSCALL_DEFINE0(uprobe)
> {
> struct pt_regs *regs = task_pt_regs(current);
> struct uprobe_syscall_args args;
> - unsigned long ip, sp, sret;
> + unsigned long probe_addr;
> int err;
>
> /* Allow execution only from uprobe trampolines. */
> - if (!in_uprobe_trampoline(regs->ip))
> - return -ENXIO;
> + if (!in_uprobe_trampoline(regs->ip, &probe_addr))
> + return -EPROTO;
>
> err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> if (err)
> goto sigill;
>
> - ip = regs->ip;
> -
> /*
> - * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> - * - adjust ip to the probe address, call saved next instruction address
> - * - adjust sp to the probe's stack frame (check trampoline code)
> + * Restore the register state as it was at the probe site:
> + * - ax/r11/cx from the trampoline-saved copies on user stack
> + * - adjust ip to the probe address based on matching slot
> + * - adjust sp to skip red zone and pushed args
> */
> regs->ax = args.ax;
> regs->r11 = args.r11;
> regs->cx = args.cx;
> - regs->ip = args.retaddr - 5;
> - regs->sp += sizeof(args);
> + regs->ip = probe_addr;
> + regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
> regs->orig_ax = -1;
>
> - sp = regs->sp;
> -
> - err = shstk_pop((u64 *)&sret);
> - if (err == -EFAULT || (!err && sret != args.retaddr))
> - goto sigill;
> -
> - handle_syscall_uprobe(regs, regs->ip);
> + handle_syscall_uprobe(regs, probe_addr);
>
> /*
> - * Some of the uprobe consumers has changed sp, we can do nothing,
> - * just return via iret.
> + * Skip the jmp instruction at the probe site (5 bytes) unless
> + * a consumer redirected execution elsewhere.
> */
> - if (regs->sp != sp) {
> - /* skip the trampoline call */
> - if (args.retaddr - 5 == regs->ip)
> - regs->ip += 5;
> - return regs->ax;
> - }
> + if (regs->ip == probe_addr)
> + regs->ip = probe_addr + 5;
>
> - regs->sp -= sizeof(args);
> -
> - /* for the case uprobe_consumer has changed ax/r11/cx */
> - args.ax = regs->ax;
> - args.r11 = regs->r11;
> - args.cx = regs->cx;
> -
> - /* keep return address unless we are instructed otherwise */
> - if (args.retaddr - 5 != regs->ip)
> - args.retaddr = regs->ip;
> -
> - if (shstk_push(args.retaddr) == -EFAULT)
> - goto sigill;
> -
> - regs->ip = ip;
> -
> - err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> - if (err)
> - goto sigill;
> -
> - /* ensure sysret, see do_syscall_64() */
> - regs->r11 = regs->flags;
> - regs->cx = regs->ip;
> - return 0;
> + /*
> + * Return via iret by returning regs->ax. This preserves all
> + * GP registers (including cx and r11) without needing any
> + * user-space cleanup code. The iret path is used because we
> + * don't set up cx/r11 for sysret.
> + */
> + return regs->ax;
>
> sigill:
> force_sig(SIGILL);
> return -1;
> }
>
> +/*
> + * All uprobe trampoline slots are identical: skip the red zone,
> + * save the three registers that syscall clobbers, then invoke
> + * the uprobe syscall. The handler returns directly to the probe
> + * caller via iret. Execution never returns to the trampoline.
> + */
> asm (
> ".pushsection .rodata\n"
> - ".balign " __stringify(PAGE_SIZE) "\n"
> - "uprobe_trampoline_entry:\n"
> + ".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
> + "uprobe_trampoline_slot:\n"
> + "lea -128(%rsp), %rsp\n"
> "push %rcx\n"
> "push %r11\n"
> "push %rax\n"
> - "mov $" __stringify(__NR_uprobe) ", %rax\n"
> + "mov $" __stringify(__NR_uprobe) ", %eax\n"
> "syscall\n"
> - "pop %rax\n"
> - "pop %r11\n"
> - "pop %rcx\n"
> - "ret\n"
> - "int3\n"
> - ".balign " __stringify(PAGE_SIZE) "\n"
> + "uprobe_trampoline_slot_end:\n"
> ".popsection\n"
> );
>
> -extern u8 uprobe_trampoline_entry[];
> +extern u8 uprobe_trampoline_slot[];
> +extern u8 uprobe_trampoline_slot_end[];
>
> static int __init arch_uprobes_init(void)
> {
> - tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> + unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
> + struct page *page;
> + u8 *page_addr;
> + int i;
> +
> + BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
> + WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
> +
> + page = alloc_page(GFP_KERNEL);
> + if (!page)
> + return -ENOMEM;
> +
> + page_addr = page_address(page);
> + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
> + memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
> +
> + tramp_mapping_pages[0] = page;
> return 0;
> }
>
> @@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
>
> enum {
> EXPECT_SWBP,
> - EXPECT_CALL,
> + EXPECT_JMP,
> };
>
> struct write_opcode_ctx {
> @@ -917,14 +947,14 @@ struct write_opcode_ctx {
> int expect;
> };
>
> -static int is_call_insn(uprobe_opcode_t *insn)
> +static int is_jmp_insn(uprobe_opcode_t *insn)
> {
> - return *insn == CALL_INSN_OPCODE;
> + return *insn == JMP32_INSN_OPCODE;
> }
>
> /*
> * Verification callback used by int3_update uprobe_write calls to make sure
> - * the underlying instruction is as expected - either int3 or call.
> + * the underlying instruction is as expected - either int3 or jmp.
> */
> static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> int nbytes, void *data)
> @@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
> if (is_swbp_insn(&old_opcode[0]))
> return 1;
> break;
> - case EXPECT_CALL:
> - if (is_call_insn(&old_opcode[0]))
> + case EXPECT_JMP:
> + if (is_jmp_insn(&old_opcode[0]))
> return 1;
> break;
> }
> @@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> * so we can skip this step for optimize == true.
> */
> if (!optimize) {
> - ctx.expect = EXPECT_CALL;
> + ctx.expect = EXPECT_JMP;
> err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
> true /* is_register */, false /* do_update_ref_ctr */,
> &ctx);
> @@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> }
>
> static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> - unsigned long vaddr, unsigned long tramp)
> + unsigned long vaddr, unsigned long slot_vaddr)
> {
> - u8 call[5];
> + u8 jmp[5];
>
> - __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> - (const void *) tramp, CALL_INSN_SIZE);
> - return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> + __text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
> + (const void *) slot_vaddr, JMP32_INSN_SIZE);
> + return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
> }
>
> static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> @@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> struct __packed __arch_relative_insn {
> u8 op;
> s32 raddr;
> - } *call = (struct __arch_relative_insn *) insn;
> + } *jmp = (struct __arch_relative_insn *) insn;
>
> - if (!is_call_insn(insn))
> + if (!is_jmp_insn(&jmp->op))
> return false;
> - return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> +
> + guard(rcu)();
> + /*
> + * resolve_uprobe_addr() expects IP pointing after syscall instruction
> + * (after the slot, basically), so adjust jump target address accordingly
> + */
> + return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
> }
>
> static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
> @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> {
> struct uprobe_trampoline *tramp;
> struct vm_area_struct *vma;
> + unsigned long slot_vaddr;
> bool new = false;
> - int err = 0;
> + int slot, err;
>
> vma = find_vma(mm, vaddr);
> if (!vma)
> @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> tramp = get_uprobe_trampoline(vaddr, &new);
> if (!tramp)
> return -EINVAL;
> - err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> - if (WARN_ON_ONCE(err) && new)
> +
> + slot = tramp_alloc_slot(tramp, vaddr);
> + if (slot < 0) {
> + if (new)
> + destroy_uprobe_trampoline(tramp);
> + return slot;
> + }
> +
> + slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
> + err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
> + if (err && new)
> destroy_uprobe_trampoline(tramp);
> return err;
> }
> diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> index 4f19a0d79b0c..1b6c113357b2 100644
> --- a/tools/lib/bpf/features.c
> +++ b/tools/lib/bpf/features.c
> @@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
> static int probe_uprobe_syscall(int token_fd)
> {
> /*
> - * If kernel supports uprobe() syscall, it will return -ENXIO when called
> - * from the outside of a kernel-generated uprobe trampoline.
> + * If kernel supports uprobe() syscall, it will return -EPROTO when
> + * called from outside a kernel-generated uprobe trampoline.
> + * Older kernels with the red-zone-clobbering bug return -ENXIO;
> + * we only enable the nop5 optimization on fixed kernels.
> */
> - return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> + return syscall(__NR_uprobe) < 0 && errno == EPROTO;
> }
> #else
> static int probe_uprobe_syscall(int token_fd)
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 955a37751b52..0d5eb4cd1ddf 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
> /* .. and check the trampoline is as expected. */
> call = (struct __arch_relative_insn *) addr;
> tramp = (void *) (call + 1) + call->raddr;
> - ASSERT_EQ(call->op, 0xe8, "call");
> + tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
> + ASSERT_EQ(call->op, 0xe9, "jmp");
> ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
>
> return tramp;
> @@ -762,7 +763,7 @@ static void test_uprobe_error(void)
> long err = syscall(__NR_uprobe);
>
> ASSERT_EQ(err, -1, "error");
> - ASSERT_EQ(errno, ENXIO, "errno");
> + ASSERT_EQ(errno, EPROTO, "errno");
> }
>
> static void __test_uprobe_syscall(void)
> diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> index 69759b27794d..9d3744d4e936 100644
> --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> @@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
> ASSERT_EQ(*addr_2, 0x90, "nop");
>
> /* call is on addr_2 + 1 address */
> - ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
> + ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
> ASSERT_EQ(skel->bss->executed, 4, "executed");
>
> cleanup:
> --
> 2.53.0-Meta
>
^ permalink raw reply
* [PATCH] uprobes: Use flexible array for xol_area bitmap
From: Rosen Penev @ 2026-05-10 21:41 UTC (permalink / raw)
To: linux-trace-kernel
Cc: Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, open list:UPROBES,
open list:PERFORMANCE EVENTS SUBSYSTEM
The XOL slot bitmap has the same lifetime as struct xol_area, but it
is currently allocated separately. That adds another allocation
failure path and a matching cleanup branch without buying any extra
flexibility.
Store the bitmap as a flexible array member and allocate it together
with the xol_area using kzalloc_flex(). The bitmap remains
zero-initialized, while the allocation and error handling become
simpler.
Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
kernel/events/uprobes.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..9ef74c2ad390 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
*/
struct xol_area {
wait_queue_head_t wq; /* if all slots are busy */
- unsigned long *bitmap; /* 0 = free slot */
struct page *page;
/*
@@ -117,6 +116,7 @@ struct xol_area {
* the vma go away, and we must handle that reasonably gracefully.
*/
unsigned long vaddr; /* Page(s) of instruction slots */
+ unsigned long bitmap[]; /* 0 = free slot */
};
static void uprobe_warn(struct task_struct *t, const char *msg)
@@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
struct xol_area *area;
void *insns;
- area = kzalloc_obj(*area);
+ area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
if (unlikely(!area))
goto out;
- area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
- GFP_KERNEL);
- if (!area->bitmap)
- goto free_area;
-
area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
if (!area->page)
- goto free_bitmap;
+ goto free_area;
area->vaddr = vaddr;
init_waitqueue_head(&area->wq);
@@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
return area;
__free_page(area->page);
- free_bitmap:
- kfree(area->bitmap);
free_area:
kfree(area);
out:
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-11 2:50 UTC (permalink / raw)
To: Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <agB8AgF3qVqDw60Z@sirena.co.uk>
HI Mark,
On 10-05-2026 18:07, Mark Brown wrote:
> On Sat, May 09, 2026 at 07:37:26AM +0530, Praveen Talari wrote:
>
>> Could you also please review the changes made in spi.c ?
>> I would appreciate any feedback or suggestions you may have.
> Please just sumbmit normal patches instead of sending partial patches in
> reply to another thread unless something is really unclear.
Sure, I will send the patches next. I just wanted to confirm a few points
>
>> @@ -1658,6 +1658,11 @@ static int spi_transfer_one_message(struct
>> spi_controller *ctlr,
>>
>> trace_spi_transfer_stop(msg, xfer);
>>
>> + if (spi_valid_txbuf(msg, xfer))
>> + trace_spi_tx_data(msg->spi, xfer->tx_buf,
>> xfer->len);
>> + if (spi_valid_rxbuf(msg, xfer))
>> + trace_spi_rx_data(msg->spi, xfer->rx_buf,
>> xfer->len);
> It feels like it'd be more helpful to log the transmit data before we do
> the send.
Sure. will take care in next patch-set.
Thanks,
Praveen Talari
^ permalink raw reply
* Re: [PATCH] uprobes: Use flexible array for xol_area bitmap
From: Masami Hiramatsu @ 2026-05-11 4:33 UTC (permalink / raw)
To: Rosen Penev
Cc: linux-trace-kernel, Masami Hiramatsu, Oleg Nesterov,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, James Clark, open list:UPROBES,
open list:PERFORMANCE EVENTS SUBSYSTEM
In-Reply-To: <20260510214118.41926-1-rosenp@gmail.com>
On Sun, 10 May 2026 14:41:18 -0700
Rosen Penev <rosenp@gmail.com> wrote:
> The XOL slot bitmap has the same lifetime as struct xol_area, but it
> is currently allocated separately. That adds another allocation
> failure path and a matching cleanup branch without buying any extra
> flexibility.
>
> Store the bitmap as a flexible array member and allocate it together
> with the xol_area using kzalloc_flex(). The bitmap remains
> zero-initialized, while the allocation and error handling become
> simpler.
>
You also have to update uprobe_clear_state(), because area->bitmap
is no longer allocated separately.
Thank you,
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
> kernel/events/uprobes.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4084e926e284..9ef74c2ad390 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
> */
> struct xol_area {
> wait_queue_head_t wq; /* if all slots are busy */
> - unsigned long *bitmap; /* 0 = free slot */
>
> struct page *page;
> /*
> @@ -117,6 +116,7 @@ struct xol_area {
> * the vma go away, and we must handle that reasonably gracefully.
> */
> unsigned long vaddr; /* Page(s) of instruction slots */
> + unsigned long bitmap[]; /* 0 = free slot */
> };
>
> static void uprobe_warn(struct task_struct *t, const char *msg)
> @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> struct xol_area *area;
> void *insns;
>
> - area = kzalloc_obj(*area);
> + area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> if (unlikely(!area))
> goto out;
>
> - area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
> - GFP_KERNEL);
> - if (!area->bitmap)
> - goto free_area;
> -
> area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> if (!area->page)
> - goto free_bitmap;
> + goto free_area;
>
> area->vaddr = vaddr;
> init_waitqueue_head(&area->wq);
> @@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> return area;
>
> __free_page(area->page);
> - free_bitmap:
> - kfree(area->bitmap);
> free_area:
> kfree(area);
> out:
> --
> 2.54.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [RFC v7 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-05-11 8:42 UTC (permalink / raw)
To: Zhang Yi, Theodore Ts'o, Andreas Dilger
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
linux-trace-kernel, linux-kernel
Hi,
(This RFC v7 series is rebased onto linux-next master as of 2026-05-09,
commit e98d21c170b0 ("Add linux-next specific files for 20260508").)
Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.
Zhang Yi suggested two possible directions to address the root cause:
1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.
Change ext4_fc_commit() to lookup logical to physical mapping using extent
status tree. This would get rid of the need to call ext4_fc_track_inode()
before acquiring i_data_sem. To do that we would need to ensure that
modified extents from the extent status tree are not evicted from memory."
2. "Alternatively, recording the mapped range of tracking might also be
feasible."
This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.
Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().
I did not fully implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).
Lock inversion / deadlock model (before):
CPU0 (metadata update) CPU1 (fast commit)
-------------------- -----------------
... hold i_data_sem (A) mutex_lock(s_fc_lock) (B)
ext4_fc_track_inode() ext4_fc_write_inode_data()
mutex_lock(s_fc_lock) (B) ext4_map_blocks()
wait FC_COMMITTING (sleep) down_read(i_data_sem) (A)
This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.
New model (this series):
CPU0 (metadata update) CPU1 (fast commit)
-------------------- -----------------
... maybe hold i_data_sem (A) jbd2_journal_lock_updates()
ext4_fc_track_*() snapshot inode + ranges (no map_blocks)
mutex_lock(s_fc_lock) (B) jbd2_journal_unlock_updates()
if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
no sleep write FC log from snapshots only
cleanup: clear COMMITTING, requeue if set
The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaining FC_COMMITTING waiter is ext4_fc_del(), which drops
s_fc_lock before sleeping.
This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.
Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.
ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.
The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.
To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.
Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.
Testing and measurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.
Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.
RFC v6 -> RFC v7:
- Rebase onto linux-next master as of 2026-05-09, commit e98d21c170b0
("Add linux-next specific files for 20260508").
- Address Sashiko review feedback for RFC v6. [2]
- Fix the reported snapshot range arithmetic issue near EXT_MAX_BLOCKS to
avoid cur_lblk / range wraparound in the snapshot walk.
- Report successfully snapshotted inode counts in ext4_fc_lock_updates when
snapshotting stops early, as reported by Sashiko.
- Use READ_ONCE() + div64_u64() for the fc_info lock_updates average, as
reported by Sashiko.
RFC v5 -> RFC v6:
- Rebase onto linux-next master as of 2026-04-08.
- Address tracepoint review feedback by relying on enum auto-increment for
snap_err values and by switching the guarded ext4_fc_lock_updates call site
to trace_call__ext4_fc_lock_updates() to avoid the double static_branch. [1]
- Keep lock window accounting unconditional for fc_info while using the guarded
direct tracepoint call.
- Fix the inode debug print format exposed by the rebase.
RFC v4 -> RFC v5:
- Patch 6: Make ext4_fc_lock_updates snap_err human readable via
TRACE_DEFINE_ENUM() + __print_symbolic(), using a single TRACE_SNAP_ERR
mapping while keeping the enum values stable for tooling.
RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
inode image and mapped ranges (recording the mapped range of tracking as
suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
and snapshot fallback reasons.
RFC v2 -> RFC v3:
- rebase on top of
https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u
RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix
RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/
RFC v4: https://lore.kernel.org/all/20260120112538.132774-1-me@linux.beauty/
RFC v5: https://lore.kernel.org/all/20260317084624.457185-1-me@linux.beauty/t/#u
RFC v6: https://lore.kernel.org/all/20260408112020.716706-1-me@linux.beauty/
[1]: https://lore.kernel.org/all/acZJl8QUYEq8voqQ@BLRRASHENOY1.amd.com/T/#u
[2]: https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty
Thanks,
Li Chen (7):
ext4: fast commit: snapshot inode state before writing log
ext4: lockdep: handle i_data_sem subclassing for special inodes
ext4: fast commit: avoid waiting for FC_COMMITTING
ext4: fast commit: avoid self-deadlock in inode snapshotting
ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
snapshots
ext4: fast commit: add lock_updates tracepoint
ext4: fast commit: export snapshot stats in fc_info
fs/ext4/ext4.h | 73 +++-
fs/ext4/fast_commit.c | 716 +++++++++++++++++++++++++++++-------
fs/ext4/inode.c | 51 +++
fs/ext4/super.c | 9 +
include/trace/events/ext4.h | 61 +++
5 files changed, 776 insertions(+), 134 deletions(-)
--
2.53.0
^ permalink raw reply
* [RFC v7 1/7] ext4: fast commit: snapshot inode state before writing log
From: Li Chen @ 2026-05-11 8:42 UTC (permalink / raw)
To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
linux-kernel
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel
In-Reply-To: <20260511084304.1559557-1-me@linux.beauty>
Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.
Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.
Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.
Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v7:
- Drop the stale i_fc_wait initialization after rebasing onto the new
linux-next base.
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Fix the inode debug print format after rebasing.
fs/ext4/ext4.h | 22 ++-
fs/ext4/fast_commit.c | 331 +++++++++++++++++++++++++++++++++++-------
fs/ext4/inode.c | 51 +++++++
3 files changed, 352 insertions(+), 52 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..e01d00dbc077 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1023,6 +1023,7 @@ enum {
I_DATA_SEM_EA
};
+struct ext4_fc_inode_snap;
/*
* fourth extended file system inode data in memory
@@ -1079,6 +1080,22 @@ struct ext4_inode_info {
/* End of lblk range that needs to be committed in this fast commit */
ext4_lblk_t i_fc_lblk_len;
+ /*
+ * Commit-time fast commit snapshots.
+ *
+ * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast
+ * commit log writing path reads the snapshot under sbi->s_fc_lock while
+ * serializing fast commit TLVs.
+ *
+ * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the
+ * corresponding cleanup / eviction paths.
+ *
+ * i_fc_snap points to per-inode snapshot data for fast commit:
+ * - a raw inode snapshot for EXT4_FC_TAG_INODE
+ * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE
+ */
+ struct ext4_fc_inode_snap *i_fc_snap;
+
spinlock_t i_raw_lock; /* protects updates to the raw inode */
/*
@@ -3080,8 +3097,9 @@ extern int ext4_file_getattr(struct mnt_idmap *, const struct path *,
struct kstat *, u32, unsigned int);
extern void ext4_dirty_inode(struct inode *, int);
extern int ext4_change_inode_journal_flag(struct inode *, int);
-extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
-extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
struct ext4_iloc *iloc);
extern int ext4_inode_attach_jinode(struct inode *inode);
extern int ext4_can_truncate(struct inode *inode);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index b3c22636251d..cd4eac4e7dcb 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -56,21 +56,23 @@
* deleted while it is being flushed.
* [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA"
* state.
- * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
- * all the exsiting handles finish and no new handles can start.
- * [4] Mark all the fast commit eligible inodes as undergoing fast commit
- * by setting "EXT4_STATE_FC_COMMITTING" state.
- * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensures
+ * that all the existing handles finish and no new handles can start.
+ * [4] Mark all the fast commit eligible inodes as undergoing fast commit by
+ * setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
+ * needed for log writing.
+ * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
* starting of new handles. If new handles try to start an update on
* any of the inodes that are being committed, ext4_fc_track_inode()
* will block until those inodes have finished the fast commit.
* [6] Commit all the directory entry updates in the fast commit space.
- * [7] Commit all the changed inodes in the fast commit space and clear
- * "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [7] Commit all the changed inodes in the fast commit space.
* [8] Write tail tag (this tag ensures the atomicity, please read the following
* section for more details).
+ * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in
+ * ext4_fc_cleanup().
*
- * All the inode updates must be enclosed within jbd2_jounrnal_start()
+ * All the inode updates must be enclosed within jbd2_journal_start()
* and jbd2_journal_stop() similar to JBD2 journaling.
*
* Fast Commit Ineligibility
@@ -200,6 +202,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
unlock_buffer(bh);
}
+static void ext4_fc_free_inode_snap(struct inode *inode);
+
static inline void ext4_fc_reset_inode(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -216,6 +220,7 @@ void ext4_fc_init_inode(struct inode *inode)
ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
INIT_LIST_HEAD(&ei->i_fc_list);
INIT_LIST_HEAD(&ei->i_fc_dilist);
+ ei->i_fc_snap = NULL;
}
static bool ext4_fc_disabled(struct super_block *sb)
@@ -246,6 +251,7 @@ void ext4_fc_del(struct inode *inode)
alloc_ctx = ext4_fc_lock(inode->i_sb);
if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
+ ext4_fc_free_inode_snap(inode);
ext4_fc_unlock(inode->i_sb, alloc_ctx);
return;
}
@@ -287,6 +293,7 @@ void ext4_fc_del(struct inode *inode)
}
finish_wait(wq, &wait.wq_entry);
}
+ ext4_fc_free_inode_snap(inode);
list_del_init(&ei->i_fc_list);
/*
@@ -829,6 +836,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
return true;
}
+struct ext4_fc_range {
+ struct list_head list;
+ u16 tag;
+ ext4_lblk_t lblk;
+ ext4_lblk_t len;
+ ext4_fsblk_t pblk;
+ bool unwritten;
+};
+
+struct ext4_fc_inode_snap {
+ struct list_head data_list;
+ unsigned int inode_len;
+ u8 inode_buf[];
+};
+
/*
* Writes inode in the fast commit space under TLV with tag @tag.
* Returns 0 on success, error on failure.
@@ -836,21 +858,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
{
struct ext4_inode_info *ei = EXT4_I(inode);
- int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
- int ret;
- struct ext4_iloc iloc;
+ struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
struct ext4_fc_inode fc_inode;
struct ext4_fc_tl tl;
u8 *dst;
+ u8 *src;
+ int inode_len;
+ int ret;
- ret = ext4_get_inode_loc(inode, &iloc);
- if (ret)
- return ret;
+ if (!snap)
+ return -ECANCELED;
- if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
- inode_len = EXT4_INODE_SIZE(inode->i_sb);
- else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
- inode_len += ei->i_extra_isize;
+ src = snap->inode_buf;
+ inode_len = snap->inode_len;
+ if (!src || inode_len == 0)
+ return -ECANCELED;
fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -866,10 +888,9 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
dst += EXT4_FC_TAG_BASE_LEN;
memcpy(dst, &fc_inode, sizeof(fc_inode));
dst += sizeof(fc_inode);
- memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len);
+ memcpy(dst, src, inode_len);
ret = 0;
err:
- brelse(iloc.bh);
return ret;
}
@@ -879,12 +900,74 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
*/
static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
{
- ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_map_blocks map;
+ struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
struct ext4_fc_add_range fc_ext;
struct ext4_fc_del_range lrange;
struct ext4_extent *ex;
+ struct ext4_fc_range *range;
+
+ if (!snap)
+ return -ECANCELED;
+
+ list_for_each_entry(range, &snap->data_list, list) {
+ if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
+ lrange.fc_ino = cpu_to_le32(inode->i_ino);
+ lrange.fc_lblk = cpu_to_le32(range->lblk);
+ lrange.fc_len = cpu_to_le32(range->len);
+ if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+ sizeof(lrange), (u8 *)&lrange, crc))
+ return -ENOSPC;
+ continue;
+ }
+
+ fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+ ex = (struct ext4_extent *)&fc_ext.fc_ex;
+ ex->ee_block = cpu_to_le32(range->lblk);
+ ex->ee_len = cpu_to_le16(range->len);
+ ext4_ext_store_pblock(ex, range->pblk);
+ if (range->unwritten)
+ ext4_ext_mark_unwritten(ex);
+ else
+ ext4_ext_mark_initialized(ex);
+
+ if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+ sizeof(fc_ext), (u8 *)&fc_ext, crc))
+ return -ENOSPC;
+ }
+
+ return 0;
+}
+
+static void ext4_fc_free_ranges(struct list_head *head)
+{
+ struct ext4_fc_range *range, *range_n;
+
+ list_for_each_entry_safe(range, range_n, head, list) {
+ list_del(&range->list);
+ kfree(range);
+ }
+}
+
+static void ext4_fc_free_inode_snap(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+
+ if (!snap)
+ return;
+
+ ext4_fc_free_ranges(&snap->data_list);
+ kfree(snap);
+ ei->i_fc_snap = NULL;
+}
+
+static int ext4_fc_snapshot_inode_data(struct inode *inode,
+ struct list_head *ranges)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+ struct ext4_map_blocks map;
int ret;
spin_lock(&ei->i_fc_lock);
@@ -892,18 +975,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
spin_unlock(&ei->i_fc_lock);
return 0;
}
- old_blk_size = ei->i_fc_lblk_start;
- new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+ start_lblk = ei->i_fc_lblk_start;
+ end_lblk = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
ei->i_fc_lblk_len = 0;
spin_unlock(&ei->i_fc_lock);
- cur_lblk_off = old_blk_size;
- ext4_debug("will try writing %d to %d for inode %llu\n",
- cur_lblk_off, new_blk_size, inode->i_ino);
+ cur_lblk = start_lblk;
+ ext4_debug("snapshot data ranges %u-%u for inode %llu\n",
+ start_lblk, end_lblk,
+ (unsigned long long)inode->i_ino);
+
+ while (cur_lblk <= end_lblk) {
+ struct ext4_fc_range *range;
- while (cur_lblk_off <= new_blk_size) {
- map.m_lblk = cur_lblk_off;
- map.m_len = new_blk_size - cur_lblk_off + 1;
+ map.m_lblk = cur_lblk;
+ map.m_len = end_lblk - cur_lblk + 1;
ret = ext4_map_blocks(NULL, inode, &map,
EXT4_GET_BLOCKS_IO_SUBMIT |
EXT4_EX_NOCACHE);
@@ -911,17 +997,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
return -ECANCELED;
if (map.m_len == 0) {
- cur_lblk_off++;
+ cur_lblk++;
continue;
}
+ range = kmalloc(sizeof(*range), GFP_NOFS);
+ if (!range)
+ return -ENOMEM;
+
+ range->lblk = map.m_lblk;
+ range->len = map.m_len;
+ range->pblk = 0;
+ range->unwritten = false;
+
if (ret == 0) {
- lrange.fc_ino = cpu_to_le32(inode->i_ino);
- lrange.fc_lblk = cpu_to_le32(map.m_lblk);
- lrange.fc_len = cpu_to_le32(map.m_len);
- if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
- sizeof(lrange), (u8 *)&lrange, crc))
- return -ENOSPC;
+ range->tag = EXT4_FC_TAG_DEL_RANGE;
} else {
unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
@@ -929,26 +1019,67 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
/* Limit the number of blocks in one extent */
map.m_len = min(max, map.m_len);
- fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
- ex = (struct ext4_extent *)&fc_ext.fc_ex;
- ex->ee_block = cpu_to_le32(map.m_lblk);
- ex->ee_len = cpu_to_le16(map.m_len);
- ext4_ext_store_pblock(ex, map.m_pblk);
- if (map.m_flags & EXT4_MAP_UNWRITTEN)
- ext4_ext_mark_unwritten(ex);
- else
- ext4_ext_mark_initialized(ex);
- if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
- sizeof(fc_ext), (u8 *)&fc_ext, crc))
- return -ENOSPC;
+ range->tag = EXT4_FC_TAG_ADD_RANGE;
+ range->len = map.m_len;
+ range->pblk = map.m_pblk;
+ range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
}
- cur_lblk_off += map.m_len;
+ INIT_LIST_HEAD(&range->list);
+ list_add_tail(&range->list, ranges);
+
+ cur_lblk += map.m_len;
}
return 0;
}
+static int ext4_fc_snapshot_inode(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_fc_inode_snap *snap;
+ int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+ struct ext4_iloc iloc;
+ LIST_HEAD(ranges);
+ int ret;
+ int alloc_ctx;
+
+ ret = ext4_get_inode_loc_noio(inode, &iloc);
+ if (ret)
+ return ret;
+
+ if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
+ inode_len = EXT4_INODE_SIZE(inode->i_sb);
+ else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+ inode_len += ei->i_extra_isize;
+
+ snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
+ if (!snap) {
+ brelse(iloc.bh);
+ return -ENOMEM;
+ }
+ INIT_LIST_HEAD(&snap->data_list);
+ snap->inode_len = inode_len;
+
+ memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
+ brelse(iloc.bh);
+
+ ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+ if (ret) {
+ kfree(snap);
+ ext4_fc_free_ranges(&ranges);
+ return ret;
+ }
+
+ alloc_ctx = ext4_fc_lock(inode->i_sb);
+ ext4_fc_free_inode_snap(inode);
+ ei->i_fc_snap = snap;
+ list_splice_tail_init(&ranges, &snap->data_list);
+ ext4_fc_unlock(inode->i_sb, alloc_ctx);
+
+ return 0;
+}
+
/* Flushes data of all the inodes in the commit queue. */
static int ext4_fc_flush_data(journal_t *journal)
@@ -999,6 +1130,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
*/
if (list_empty(&fc_dentry->fcd_dilist))
continue;
+ /*
+ * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created
+ * inode's i_fc_dilist list (kept singular), so we can recover the
+ * inode through it.
+ */
ei = list_first_entry(&fc_dentry->fcd_dilist,
struct ext4_inode_info, i_fc_dilist);
inode = &ei->vfs_inode;
@@ -1023,6 +1159,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
return 0;
}
+static int ext4_fc_snapshot_inodes(journal_t *journal)
+{
+ struct super_block *sb = journal->j_private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_inode_info *iter;
+ struct ext4_fc_dentry_update *fc_dentry;
+ struct inode **inodes;
+ unsigned int nr_inodes = 0;
+ unsigned int i = 0;
+ int ret = 0;
+ int alloc_ctx;
+
+ alloc_ctx = ext4_fc_lock(sb);
+ list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+ nr_inodes++;
+
+ list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+ struct ext4_inode_info *ei;
+
+ if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+ continue;
+ if (list_empty(&fc_dentry->fcd_dilist))
+ continue;
+
+ /* See the comment in ext4_fc_commit_dentry_updates(). */
+ ei = list_first_entry(&fc_dentry->fcd_dilist,
+ struct ext4_inode_info, i_fc_dilist);
+ if (!list_empty(&ei->i_fc_list))
+ continue;
+
+ nr_inodes++;
+ }
+ ext4_fc_unlock(sb, alloc_ctx);
+
+ if (!nr_inodes)
+ return 0;
+
+ inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+ if (!inodes)
+ return -ENOMEM;
+
+ alloc_ctx = ext4_fc_lock(sb);
+ list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+ inodes[i] = igrab(&iter->vfs_inode);
+ if (inodes[i])
+ i++;
+ }
+
+ list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+ struct ext4_inode_info *ei;
+
+ if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+ continue;
+ if (list_empty(&fc_dentry->fcd_dilist))
+ continue;
+
+ /* See the comment in ext4_fc_commit_dentry_updates(). */
+ ei = list_first_entry(&fc_dentry->fcd_dilist,
+ struct ext4_inode_info, i_fc_dilist);
+ if (!list_empty(&ei->i_fc_list))
+ continue;
+
+ inodes[i] = igrab(&ei->vfs_inode);
+ if (inodes[i])
+ i++;
+ }
+ ext4_fc_unlock(sb, alloc_ctx);
+
+ for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+ ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+ if (ret)
+ break;
+ }
+
+ for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+ if (inodes[nr_inodes])
+ iput(inodes[nr_inodes]);
+ }
+ kvfree(inodes);
+ return ret;
+}
+
static int ext4_fc_perform_commit(journal_t *journal)
{
struct super_block *sb = journal->j_private;
@@ -1095,7 +1313,11 @@ static int ext4_fc_perform_commit(journal_t *journal)
EXT4_STATE_FC_COMMITTING);
}
ext4_fc_unlock(sb, alloc_ctx);
+
+ ret = ext4_fc_snapshot_inodes(journal);
jbd2_journal_unlock_updates(journal);
+ if (ret)
+ return ret;
/*
* Step 5: If file system device is different from journal device,
@@ -1292,6 +1514,7 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
struct ext4_inode_info,
i_fc_list);
list_del_init(&ei->i_fc_list);
+ ext4_fc_free_inode_snap(&ei->vfs_inode);
ext4_clear_inode_state(&ei->vfs_inode,
EXT4_STATE_FC_COMMITTING);
if (tid_geq(tid, ei->i_sync_tid)) {
@@ -1327,6 +1550,14 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
struct ext4_fc_dentry_update,
fcd_list);
list_del_init(&fc_dentry->fcd_list);
+ if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
+ !list_empty(&fc_dentry->fcd_dilist)) {
+ /* See the comment in ext4_fc_commit_dentry_updates(). */
+ ei = list_first_entry(&fc_dentry->fcd_dilist,
+ struct ext4_inode_info,
+ i_fc_dilist);
+ ext4_fc_free_inode_snap(&ei->vfs_inode);
+ }
list_del_init(&fc_dentry->fcd_dilist);
release_dentry_name_snapshot(&fc_dentry->fcd_name);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..4678612f82e8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5025,6 +5025,57 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
return ret;
}
+/*
+ * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_loc().
+ * It looks up the inode table block in the buffer cache and returns -EAGAIN if
+ * the block is not present or not uptodate, without starting any I/O.
+ */
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc)
+{
+ struct super_block *sb = inode->i_sb;
+ struct ext4_group_desc *gdp;
+ struct buffer_head *bh;
+ ext4_fsblk_t block;
+ int inodes_per_block, inode_offset;
+ unsigned long ino = inode->i_ino;
+
+ iloc->bh = NULL;
+ if (ino < EXT4_ROOT_INO ||
+ ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+ return -EFSCORRUPTED;
+
+ iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+ gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+ if (!gdp)
+ return -EIO;
+
+ /* Figure out the offset within the block group inode table. */
+ inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+ inode_offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb));
+ iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+ block = ext4_inode_table(sb, gdp);
+ if (block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) ||
+ block >= ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+ ext4_error(sb,
+ "Invalid inode table block %llu in block_group %u",
+ block, iloc->block_group);
+ return -EFSCORRUPTED;
+ }
+ block += inode_offset / inodes_per_block;
+
+ bh = sb_find_get_block(sb, block);
+ if (!bh)
+ return -EAGAIN;
+ if (!ext4_buffer_uptodate(bh)) {
+ brelse(bh);
+ return -EAGAIN;
+ }
+
+ iloc->bh = bh;
+ return 0;
+}
+
int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
struct ext4_iloc *iloc)
--
2.53.0
^ permalink raw reply related
* [RFC v7 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-05-11 8:42 UTC (permalink / raw)
To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
linux-kernel
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel
In-Reply-To: <20260511084304.1559557-1-me@linux.beauty>
Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.
lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.
Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
without changing the subclassing logic.
fs/ext4/ext4.h | 4 +++-
fs/ext4/super.c | 8 ++++++++
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e01d00dbc077..05c8f67625b4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1015,12 +1015,14 @@ do { \
* than the first
* I_DATA_SEM_QUOTA - Used for quota inodes only
* I_DATA_SEM_EA - Used for ea_inodes only
+ * I_DATA_SEM_JOURNAL - Used for journal inode only
*/
enum {
I_DATA_SEM_NORMAL = 0,
I_DATA_SEM_OTHER,
I_DATA_SEM_QUOTA,
- I_DATA_SEM_EA
+ I_DATA_SEM_EA,
+ I_DATA_SEM_JOURNAL
};
struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3c869f0001c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1431,6 +1431,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ext4_fc_init_inode(&ei->vfs_inode);
spin_lock_init(&ei->i_fc_lock);
mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+ lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
return &ei->vfs_inode;
}
@@ -5910,6 +5913,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
return ERR_PTR(-EFSCORRUPTED);
}
+#ifdef CONFIG_LOCKDEP
+ lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+ I_DATA_SEM_JOURNAL);
+#endif
+
ext4_debug("Journal inode found at %p: %lld bytes\n",
journal_inode, journal_inode->i_size);
return journal_inode;
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox