* [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-22 4:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
Alexander Viro, Christian Brauner, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
Christian König
Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
Philipp Stanner, linux-block, linux-kernel, cgroups,
linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
dri-devel, linux-perf-users, linux-trace-kernel, kexec,
live-patching, linux-modules, linux-crypto, linux-pm, rcu,
sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
The list_for_each*_safe() helpers are used when the loop body may
remove the current entry. Their API exposes the temporary cursor at
every call site, even though most users only need it for the iterator
implementation and never reference it in the loop body.
Add *_mutable() variants for list and hlist iteration. The new helpers
support both forms: callers may keep passing an explicit temporary cursor
when they need to inspect or reset it, or omit it and let the helper use
a unique internal cursor.
This makes call sites that only mutate the list through the current entry
less noisy, while keeping the existing *_safe() helpers available for
compatibility.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
1 file changed, 231 insertions(+), 38 deletions(-)
diff --git a/include/linux/list.h b/include/linux/list.h
index 09d979976b3b..1081def7cea9 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -7,6 +7,7 @@
#include <linux/stddef.h>
#include <linux/poison.h>
#include <linux/const.h>
+#include <linux/args.h>
#include <asm/barrier.h>
@@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
#define list_for_each_prev(pos, head) \
for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
-/**
- * list_for_each_safe - iterate over a list safe against removal of list entry
- * @pos: the &struct list_head to use as a loop cursor.
- * @n: another &struct list_head to use as temporary storage
- * @head: the head for your list.
+/*
+ * list_for_each_safe is an old interface, use list_for_each_mutable instead.
*/
#define list_for_each_safe(pos, n, head) \
for (pos = (head)->next, n = pos->next; \
!list_is_head(pos, (head)); \
pos = n, n = pos->next)
+#define __list_for_each_mutable_internal(pos, tmp, head) \
+ for (typeof(pos) tmp = (pos = (head)->next)->next; \
+ !list_is_head(pos, (head)); \
+ pos = tmp, tmp = pos->next)
+
+#define __list_for_each_mutable1(pos, head) \
+ __list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
+
+#define __list_for_each_mutable2(pos, next, head) \
+ list_for_each_safe(pos, next, head)
+
/**
- * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
+ * list_for_each_mutable - iterate over a list safe against entry removal
* @pos: the &struct list_head to use as a loop cursor.
- * @n: another &struct list_head to use as temporary storage
- * @head: the head for your list.
+ * @...: either (head) or (next, head)
+ *
+ * next: another &struct list_head to use as optional temporary storage.
+ * The temporary cursor is internal unless explicitly supplied by
+ * the caller.
+ * head: the head for your list.
+ */
+#define list_for_each_mutable(pos, ...) \
+ CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__)) \
+ (pos, __VA_ARGS__)
+
+/*
+ * list_for_each_prev_safe is an old interface, use list_for_each_prev_mutable instead.
*/
#define list_for_each_prev_safe(pos, n, head) \
for (pos = (head)->prev, n = pos->prev; \
!list_is_head(pos, (head)); \
pos = n, n = pos->prev)
+#define __list_for_each_prev_mutable_internal(pos, tmp, head) \
+ for (typeof(pos) tmp = (pos = (head)->prev)->prev; \
+ !list_is_head(pos, (head)); \
+ pos = tmp, tmp = pos->prev)
+
+#define __list_for_each_prev_mutable1(pos, head) \
+ __list_for_each_prev_mutable_internal(pos, __UNIQUE_ID(prev), head)
+
+#define __list_for_each_prev_mutable2(pos, prev, head) \
+ list_for_each_prev_safe(pos, prev, head)
+
+/**
+ * list_for_each_prev_mutable - iterate over a list backwards safe against entry removal
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @...: either (head) or (prev, head)
+ *
+ * prev: another &struct list_head to use as optional temporary storage.
+ * The temporary cursor is internal unless explicitly supplied by
+ * the caller.
+ * head: the head for your list.
+ */
+#define list_for_each_prev_mutable(pos, ...) \
+ CONCATENATE(__list_for_each_prev_mutable, COUNT_ARGS(__VA_ARGS__)) \
+ (pos, __VA_ARGS__)
+
/**
* list_count_nodes - count nodes in the list
* @head: the head for your list.
@@ -895,12 +940,8 @@ static inline size_t list_count_nodes(struct list_head *head)
for (; !list_entry_is_head(pos, head, member); \
pos = list_prev_entry(pos, member))
-/**
- * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
- * @pos: the type * to use as a loop cursor.
- * @n: another type * to use as temporary storage
- * @head: the head for your list.
- * @member: the name of the list_head within the struct.
+/*
+ * list_for_each_entry_safe is an old interface, use list_for_each_entry_mutable instead.
*/
#define list_for_each_entry_safe(pos, n, head, member) \
for (pos = list_first_entry(head, typeof(*pos), member), \
@@ -908,15 +949,36 @@ static inline size_t list_count_nodes(struct list_head *head)
!list_entry_is_head(pos, head, member); \
pos = n, n = list_next_entry(n, member))
+#define __list_for_each_entry_mutable_internal(pos, tmp, head, member) \
+ for (typeof(pos) tmp = list_next_entry(pos = \
+ list_first_entry(head, typeof(*pos), member), member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable2(pos, head, member) \
+ __list_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable3(pos, next, head, member) \
+ list_for_each_entry_safe(pos, next, head, member)
+
/**
- * list_for_each_entry_safe_continue - continue list iteration safe against removal
+ * list_for_each_entry_mutable - iterate over a list safe against entry removal
* @pos: the type * to use as a loop cursor.
- * @n: another type * to use as temporary storage
- * @head: the head for your list.
- * @member: the name of the list_head within the struct.
+ * @...: either (head, member) or (next, head, member)
*
- * Iterate over list of given type, continuing after current point,
- * safe against removal of list entry.
+ * next: another type * to use as optional temporary storage. The
+ * temporary cursor is internal unless explicitly supplied by the
+ * caller.
+ * head: the head for your list.
+ * member: the name of the list_head within the struct.
+ */
+#define list_for_each_entry_mutable(pos, ...) \
+ CONCATENATE(__list_for_each_entry_mutable, COUNT_ARGS(__VA_ARGS__)) \
+ (pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_continue is an old interface,
+ * use list_for_each_entry_mutable_continue instead.
*/
#define list_for_each_entry_safe_continue(pos, n, head, member) \
for (pos = list_next_entry(pos, member), \
@@ -924,30 +986,79 @@ static inline size_t list_count_nodes(struct list_head *head)
!list_entry_is_head(pos, head, member); \
pos = n, n = list_next_entry(n, member))
+#define __list_for_each_entry_mutable_continue_internal(pos, tmp, head, member) \
+ for (typeof(pos) tmp = list_next_entry(pos = \
+ list_next_entry(pos, member), member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_continue2(pos, head, member) \
+ __list_for_each_entry_mutable_continue_internal(pos, \
+ __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable_continue3(pos, next, head, member) \
+ list_for_each_entry_safe_continue(pos, next, head, member)
+
/**
- * list_for_each_entry_safe_from - iterate over list from current point safe against removal
+ * list_for_each_entry_mutable_continue - continue list iteration safe against removal
* @pos: the type * to use as a loop cursor.
- * @n: another type * to use as temporary storage
- * @head: the head for your list.
- * @member: the name of the list_head within the struct.
+ * @...: either (head, member) or (next, head, member)
*
- * Iterate over list of given type from current point, safe against
- * removal of list entry.
+ * next: another type * to use as optional temporary storage. The
+ * temporary cursor is internal unless explicitly supplied by the
+ * caller.
+ * head: the head for your list.
+ * member: the name of the list_head within the struct.
+ *
+ * Iterate over list of given type, continuing after current point,
+ * safe against removal of list entry.
+ */
+#define list_for_each_entry_mutable_continue(pos, ...) \
+ CONCATENATE(__list_for_each_entry_mutable_continue, \
+ COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_from is an old interface,
+ * use list_for_each_entry_mutable_from instead.
*/
#define list_for_each_entry_safe_from(pos, n, head, member) \
for (n = list_next_entry(pos, member); \
!list_entry_is_head(pos, head, member); \
pos = n, n = list_next_entry(n, member))
+#define __list_for_each_entry_mutable_from_internal(pos, tmp, head, member) \
+ for (typeof(pos) tmp = list_next_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_from2(pos, head, member) \
+ __list_for_each_entry_mutable_from_internal(pos, \
+ __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable_from3(pos, next, head, member) \
+ list_for_each_entry_safe_from(pos, next, head, member)
+
/**
- * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
+ * list_for_each_entry_mutable_from - iterate over list from current point safe against removal
* @pos: the type * to use as a loop cursor.
- * @n: another type * to use as temporary storage
- * @head: the head for your list.
- * @member: the name of the list_head within the struct.
+ * @...: either (head, member) or (next, head, member)
*
- * Iterate backwards over list of given type, safe against removal
- * of list entry.
+ * next: another type * to use as optional temporary storage. The
+ * temporary cursor is internal unless explicitly supplied by the
+ * caller.
+ * head: the head for your list.
+ * member: the name of the list_head within the struct.
+ *
+ * Iterate over list of given type from current point, safe against
+ * removal of list entry.
+ */
+#define list_for_each_entry_mutable_from(pos, ...) \
+ CONCATENATE(__list_for_each_entry_mutable_from, \
+ COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_reverse is an old interface,
+ * use list_for_each_entry_mutable_reverse instead.
*/
#define list_for_each_entry_safe_reverse(pos, n, head, member) \
for (pos = list_last_entry(head, typeof(*pos), member), \
@@ -955,6 +1066,37 @@ static inline size_t list_count_nodes(struct list_head *head)
!list_entry_is_head(pos, head, member); \
pos = n, n = list_prev_entry(n, member))
+#define __list_for_each_entry_mutable_reverse_internal(pos, tmp, head, member) \
+ for (typeof(pos) tmp = list_prev_entry(pos = \
+ list_last_entry(head, typeof(*pos), member), member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = tmp, tmp = list_prev_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_reverse2(pos, head, member) \
+ __list_for_each_entry_mutable_reverse_internal(pos, \
+ __UNIQUE_ID(prev), head, member)
+
+#define __list_for_each_entry_mutable_reverse3(pos, prev, head, member) \
+ list_for_each_entry_safe_reverse(pos, prev, head, member)
+
+/**
+ * list_for_each_entry_mutable_reverse - iterate backwards over list safe against removal
+ * @pos: the type * to use as a loop cursor.
+ * @...: either (head, member) or (prev, head, member)
+ *
+ * prev: another type * to use as optional temporary storage. The
+ * temporary cursor is internal unless explicitly supplied by the
+ * caller.
+ * head: the head for your list.
+ * member: the name of the list_head within the struct.
+ *
+ * Iterate backwards over list of given type, safe against removal
+ * of list entry.
+ */
+#define list_for_each_entry_mutable_reverse(pos, ...) \
+ CONCATENATE(__list_for_each_entry_mutable_reverse, \
+ COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
/**
* list_safe_reset_next - reset a stale list_for_each_entry_safe loop
* @pos: the loop cursor used in the list_for_each_entry_safe loop
@@ -1189,6 +1331,31 @@ static inline void hlist_splice_init(struct hlist_head *from,
for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
pos = n)
+#define __hlist_for_each_mutable_internal(pos, tmp, head) \
+ for (typeof(pos) tmp = (pos = (head)->first) ? pos->next : NULL; \
+ pos; \
+ pos = tmp, tmp = pos ? pos->next : NULL)
+
+#define __hlist_for_each_mutable1(pos, head) \
+ __hlist_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
+
+#define __hlist_for_each_mutable2(pos, next, head) \
+ hlist_for_each_safe(pos, next, head)
+
+/**
+ * hlist_for_each_mutable - iterate over a hlist safe against entry removal
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @...: either (head) or (next, head)
+ *
+ * next: another &struct hlist_node to use as optional temporary storage.
+ * The temporary cursor is internal unless explicitly supplied by
+ * the caller.
+ * head: the head for your hlist.
+ */
+#define hlist_for_each_mutable(pos, ...) \
+ CONCATENATE(__hlist_for_each_mutable, COUNT_ARGS(__VA_ARGS__)) \
+ (pos, __VA_ARGS__)
+
#define hlist_entry_safe(ptr, type, member) \
({ typeof(ptr) ____ptr = (ptr); \
____ptr ? hlist_entry(____ptr, type, member) : NULL; \
@@ -1224,18 +1391,44 @@ static inline void hlist_splice_init(struct hlist_head *from,
for (; pos; \
pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
-/**
- * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
- * @pos: the type * to use as a loop cursor.
- * @n: a &struct hlist_node to use as temporary storage
- * @head: the head for your list.
- * @member: the name of the hlist_node within the struct.
+/*
+ * hlist_for_each_entry_safe is an old interface, use hlist_for_each_entry_mutable instead.
*/
#define hlist_for_each_entry_safe(pos, n, head, member) \
for (pos = hlist_entry_safe((head)->first, typeof(*pos), member);\
pos && ({ n = pos->member.next; 1; }); \
pos = hlist_entry_safe(n, typeof(*pos), member))
+#define __hlist_for_each_entry_mutable_internal(pos, tmp, head, member) \
+ for (struct hlist_node *tmp = (pos = \
+ hlist_entry_safe((head)->first, typeof(*pos), member)) ? \
+ pos->member.next : NULL; \
+ pos; \
+ pos = hlist_entry_safe((tmp), typeof(*pos), member), \
+ tmp = pos ? pos->member.next : NULL)
+
+#define __hlist_for_each_entry_mutable2(pos, head, member) \
+ __hlist_for_each_entry_mutable_internal(pos, \
+ __UNIQUE_ID(next), head, member)
+
+#define __hlist_for_each_entry_mutable3(pos, next, head, member) \
+ hlist_for_each_entry_safe(pos, next, head, member)
+
+/**
+ * hlist_for_each_entry_mutable - iterate over hlist safe against entry removal
+ * @pos: the type * to use as a loop cursor.
+ * @...: either (head, member) or (next, head, member)
+ *
+ * next: a &struct hlist_node to use as optional temporary storage. The
+ * temporary cursor is internal unless explicitly supplied by the
+ * caller.
+ * head: the head for your hlist.
+ * member: the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_mutable(pos, ...) \
+ CONCATENATE(__hlist_for_each_entry_mutable, \
+ COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
/**
* hlist_count_nodes - count nodes in the hlist
* @head: the head for your hlist.
--
2.43.0
^ permalink raw reply related
* [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-22 4:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
Alexander Viro, Christian Brauner, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
Christian König
Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
Philipp Stanner, linux-block, linux-kernel, cgroups,
linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
dri-devel, linux-perf-users, linux-trace-kernel, kexec,
live-patching, linux-modules, linux-crypto, linux-pm, rcu,
sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao
From: chengkaitao <chengkaitao@kylinos.cn>
The list_for_each*_safe() helpers are used when the loop body may remove
the current entry. Their current interface, however, forces every caller
to define a temporary cursor outside the macro and pass it in, even when
the caller never uses that cursor directly. For most call sites this
extra cursor is just boilerplate required by the macro implementation.
This is awkward because the saved next pointer is an internal detail of
the iteration. Callers that only remove or move the current entry do not
need to spell it out.
The _safe() suffix has also caused confusion. Christian Koenig pointed
out that the name is easy to read as a thread-safe variant, especially
for beginners, even though it only means that the iterator keeps enough
state to tolerate removal of the current entry. He suggested _mutable()
as a clearer description of what the loop permits.
Add *_mutable() iterator variants for list, hlist and llist. The new
helpers are variadic and support both forms. In the common case, the
caller omits the temporary cursor and the macro creates a unique internal
cursor with typeof(pos) and __UNIQUE_ID(). If a loop really needs an
explicit temporary cursor, the caller can still pass it and the helper
keeps the existing *_safe() behaviour.
For example, a call site may use the shorter form:
list_for_each_entry_mutable(pos, head, member)
or keep the explicit temporary cursor form:
list_for_each_entry_mutable(pos, tmp, head, member)
The existing *_safe() helpers remain available for compatibility. This
series only converts users in mm, block, kernel, init and io_uring. If
this approach looks acceptable, the remaining users can be converted in
follow-up series.
Changes in v3 (Christian König, Andy Shevchenko):
- Convert safe list walks to mutable iterators
Changes in v2 (Muchun Song, Andy Shevchenko):
- Drop the list_for_each_entry_mutable*() helpers from v1 and make the
cursor change directly in the existing list_for_each_entry*() helpers.
- Open-code special list walks that rely on updating the loop cursor in
the body, preserving their existing traversal semantics.
Link to v2:
https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
Link to v1:
https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
Kaitao Cheng (7):
list: Add mutable iterator variants
llist: Add mutable iterator variants
mm: Use mutable list iterators
block: Use mutable list iterators
kernel: Use mutable list iterators
initramfs: Use mutable list iterator
io_uring: Use mutable list iterators
block/bfq-iosched.c | 17 +-
block/blk-cgroup.c | 12 +-
block/blk-flush.c | 4 +-
block/blk-iocost.c | 18 +-
block/blk-mq.c | 8 +-
block/blk-throttle.c | 4 +-
block/kyber-iosched.c | 4 +-
block/partitions/ldm.c | 8 +-
block/sed-opal.c | 4 +-
include/linux/list.h | 269 ++++++++++++++++++++++++----
include/linux/llist.h | 81 +++++++--
init/initramfs.c | 5 +-
io_uring/cancel.c | 6 +-
io_uring/poll.c | 3 +-
io_uring/rw.c | 4 +-
io_uring/timeout.c | 8 +-
io_uring/uring_cmd.c | 3 +-
kernel/audit_tree.c | 4 +-
kernel/audit_watch.c | 16 +-
kernel/auditfilter.c | 4 +-
kernel/auditsc.c | 4 +-
kernel/bpf/arena.c | 10 +-
kernel/bpf/arraymap.c | 8 +-
kernel/bpf/bpf_local_storage.c | 3 +-
kernel/bpf/bpf_lru_list.c | 25 ++-
kernel/bpf/btf.c | 18 +-
kernel/bpf/cgroup.c | 7 +-
kernel/bpf/cpumap.c | 4 +-
kernel/bpf/devmap.c | 10 +-
kernel/bpf/helpers.c | 8 +-
kernel/bpf/local_storage.c | 4 +-
kernel/bpf/memalloc.c | 16 +-
kernel/bpf/offload.c | 8 +-
kernel/bpf/states.c | 4 +-
kernel/bpf/stream.c | 4 +-
kernel/bpf/verifier.c | 6 +-
kernel/cgroup/cgroup-v1.c | 4 +-
kernel/cgroup/cgroup.c | 54 +++---
kernel/cgroup/dmem.c | 12 +-
kernel/cgroup/rdma.c | 8 +-
kernel/events/core.c | 44 +++--
kernel/events/uprobes.c | 12 +-
kernel/exit.c | 8 +-
kernel/fail_function.c | 4 +-
kernel/gcov/clang.c | 4 +-
kernel/irq_work.c | 4 +-
kernel/kexec_core.c | 4 +-
kernel/kprobes.c | 16 +-
kernel/livepatch/core.c | 4 +-
kernel/livepatch/core.h | 4 +-
kernel/liveupdate/kho_block.c | 4 +-
kernel/liveupdate/luo_flb.c | 4 +-
kernel/locking/rwsem.c | 2 +-
kernel/locking/test-ww_mutex.c | 2 +-
kernel/module/main.c | 11 +-
kernel/padata.c | 4 +-
kernel/power/snapshot.c | 8 +-
kernel/power/wakelock.c | 4 +-
kernel/printk/printk.c | 11 +-
kernel/ptrace.c | 4 +-
kernel/rcu/rcutorture.c | 3 +-
kernel/rcu/tasks.h | 9 +-
kernel/rcu/tree.c | 6 +-
kernel/resource.c | 4 +-
kernel/sched/core.c | 4 +-
kernel/sched/ext.c | 22 +--
kernel/sched/fair.c | 28 +--
kernel/sched/topology.c | 4 +-
kernel/sched/wait.c | 4 +-
kernel/seccomp.c | 4 +-
kernel/signal.c | 11 +-
kernel/smp.c | 4 +-
kernel/taskstats.c | 8 +-
kernel/time/clockevents.c | 6 +-
kernel/time/clocksource.c | 4 +-
kernel/time/posix-cpu-timers.c | 4 +-
kernel/time/posix-timers.c | 3 +-
kernel/torture.c | 3 +-
kernel/trace/bpf_trace.c | 4 +-
kernel/trace/ftrace.c | 49 +++--
kernel/trace/ring_buffer.c | 25 ++-
kernel/trace/trace.c | 12 +-
kernel/trace/trace_dynevent.c | 6 +-
kernel/trace/trace_dynevent.h | 5 +-
kernel/trace/trace_events.c | 35 ++--
kernel/trace/trace_events_filter.c | 4 +-
kernel/trace/trace_events_hist.c | 8 +-
kernel/trace/trace_events_trigger.c | 17 +-
kernel/trace/trace_events_user.c | 16 +-
kernel/trace/trace_stat.c | 4 +-
kernel/user-return-notifier.c | 3 +-
kernel/workqueue.c | 16 +-
mm/backing-dev.c | 8 +-
mm/balloon.c | 8 +-
mm/cma.c | 4 +-
mm/compaction.c | 4 +-
mm/damon/core.c | 4 +-
mm/damon/sysfs-schemes.c | 4 +-
mm/dmapool.c | 4 +-
mm/huge_memory.c | 8 +-
mm/hugetlb.c | 56 +++---
mm/hugetlb_vmemmap.c | 16 +-
mm/khugepaged.c | 14 +-
mm/kmemleak.c | 7 +-
mm/ksm.c | 25 +--
mm/list_lru.c | 4 +-
mm/memcontrol-v1.c | 8 +-
mm/memory-failure.c | 12 +-
mm/memory-tiers.c | 4 +-
mm/migrate.c | 23 ++-
mm/mmu_notifier.c | 9 +-
mm/page_alloc.c | 8 +-
mm/page_reporting.c | 2 +-
mm/percpu.c | 11 +-
mm/pgtable-generic.c | 4 +-
mm/rmap.c | 10 +-
mm/shmem.c | 9 +-
mm/slab_common.c | 14 +-
mm/slub.c | 33 ++--
mm/swapfile.c | 4 +-
mm/userfaultfd.c | 12 +-
mm/vmalloc.c | 24 +--
mm/vmscan.c | 7 +-
mm/zsmalloc.c | 4 +-
124 files changed, 875 insertions(+), 681 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH v2 3/7] rust: doctest: add LocalModule fallback for #[vtable] ThisModule
From: Alvin Sun @ 2026-06-22 2:52 UTC (permalink / raw)
To: Andreas Hindborg, Gary Guo, Miguel Ojeda, Boqun Feng,
Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block
In-Reply-To: <87fr2kf3m2.fsf@t14s.mail-host-address-is-not-set>
On 6/18/26 20:13, Andreas Hindborg wrote:
> Alvin Sun <alvin.sun@linux.dev> writes:
>
>> Add a `LocalModule` struct with a null-pointer `ModuleMetadata` impl
>> in the doctest harness, so that `crate::LocalModule` (auto-inserted
>> by `#[vtable]`) resolves correctly when there is no `module!` macro.
>>
>> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
>
> Does this need to be ordered before the vtable auto insert in the patch series?
Yes, you're right — this patch would be better placed before the vtable
auto-insert patch to avoid a temporary state where the doctest harness
doesn't provide the LocalModule fallback that #[vtable] expects.
Thanks for the suggestions from you and Gary. v3 has been sent to the
list:
https://lore.kernel.org/rust-for-linux/20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev
Best regards,
Alvin Sun
>
> Best regards,
> Andreas Hindborg
>
>
^ permalink raw reply
* [PATCH v3 4/6] rust: drm: set fops.owner from driver module pointer
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Change `create_fops()` to accept an owner module pointer instead of
hardcoding `null_mut()`, ensuring the kernel correctly tracks the
module owning the DRM device's file operations.
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
rust/kernel/drm/device.rs | 3 ++-
rust/kernel/drm/gem/mod.rs | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs
index 403fc35353c74..221667d843c41 100644
--- a/rust/kernel/drm/device.rs
+++ b/rust/kernel/drm/device.rs
@@ -111,7 +111,8 @@ impl<T: drm::Driver> Device<T> {
fops: &Self::GEM_FOPS,
};
- const GEM_FOPS: bindings::file_operations = drm::gem::create_fops();
+ const GEM_FOPS: bindings::file_operations =
+ drm::gem::create_fops(crate::this_module::<T::OwnerModule>().as_ptr());
/// Create a new `drm::Device` for a `drm::Driver`.
pub fn new(dev: &device::Device, data: impl PinInit<T::Data, Error>) -> Result<ARef<Self>> {
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs
index 01b5bd47a3332..9a203efc59116 100644
--- a/rust/kernel/drm/gem/mod.rs
+++ b/rust/kernel/drm/gem/mod.rs
@@ -357,10 +357,10 @@ impl<T: DriverObject> AllocImpl for Object<T> {
};
}
-pub(super) const fn create_fops() -> bindings::file_operations {
+pub(super) const fn create_fops(owner: *mut bindings::module) -> bindings::file_operations {
let mut fops: bindings::file_operations = pin_init::zeroed();
- fops.owner = core::ptr::null_mut();
+ fops.owner = owner;
fops.open = Some(bindings::drm_open);
fops.release = Some(bindings::drm_release);
fops.unlocked_ioctl = Some(bindings::drm_ioctl);
--
2.43.0
^ permalink raw reply related
* [PATCH v3 5/6] rust: miscdevice: set fops.owner from driver module pointer
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Set the miscdevice fops owner field from the driver module pointer
via the `this_module::<T::OwnerModule>()` helper, instead of
defaulting to null.
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
rust/kernel/miscdevice.rs | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs
index 83ce50def5ac9..04fe4697ff564 100644
--- a/rust/kernel/miscdevice.rs
+++ b/rust/kernel/miscdevice.rs
@@ -26,10 +26,11 @@
mm::virt::VmaNew,
prelude::*,
seq_file::SeqFile,
+ this_module,
types::{
ForeignOwnable,
Opaque, //
- },
+ }, //
};
use core::marker::PhantomData;
@@ -430,6 +431,7 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
} else {
None
},
+ owner: this_module::<T::OwnerModule>().as_ptr(),
..pin_init::zeroed()
};
--
2.43.0
^ permalink raw reply related
* [PATCH v3 6/6] rust: configfs: use `LocalModule` for `THIS_MODULE`
From: Alvin Sun @ 2026-06-22 2:45 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Replace the `THIS_MODULE` static reference in the `configfs_attrs!`
macro with `this_module::<LocalModule>()`, and update
rnull to import `LocalModule` instead of `THIS_MODULE`, consistent
with the move of `THIS_MODULE` into the `ModuleMetadata` trait.
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
drivers/block/rnull/configfs.rs | 6 ++----
rust/kernel/configfs.rs | 8 +++++---
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
index c10a55fc58948..b2547ad1e5ddd 100644
--- a/drivers/block/rnull/configfs.rs
+++ b/drivers/block/rnull/configfs.rs
@@ -1,9 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
-use super::{
- NullBlkDevice,
- THIS_MODULE, //
-};
+use super::NullBlkDevice;
+use crate::LocalModule;
use kernel::{
block::mq::gen_disk::{
GenDisk,
diff --git a/rust/kernel/configfs.rs b/rust/kernel/configfs.rs
index 2339c6467325d..8bd32627386e0 100644
--- a/rust/kernel/configfs.rs
+++ b/rust/kernel/configfs.rs
@@ -875,7 +875,7 @@ fn as_ptr(&self) -> *const bindings::config_item_type {
/// configfs::Subsystem<Configuration>,
/// Configuration
/// >::new_with_child_ctor::<N,Child>(
-/// &THIS_MODULE,
+/// ::kernel::this_module::<LocalModule>(),
/// &CONFIGURATION_ATTRS
/// );
///
@@ -1021,7 +1021,8 @@ macro_rules! configfs_attrs {
static [< $data:upper _TPE >] : $crate::configfs::ItemType<$container, $data> =
$crate::configfs::ItemType::<$container, $data>::new::<N>(
- &THIS_MODULE, &[<$ data:upper _ATTRS >]
+ $crate::this_module::<LocalModule>(),
+ &[<$ data:upper _ATTRS >]
);
)?
@@ -1030,7 +1031,8 @@ macro_rules! configfs_attrs {
$crate::configfs::ItemType<$container, $data> =
$crate::configfs::ItemType::<$container, $data>::
new_with_child_ctor::<N, $child>(
- &THIS_MODULE, &[<$ data:upper _ATTRS >]
+ $crate::this_module::<LocalModule>(),
+ &[<$ data:upper _ATTRS >]
);
)?
--
2.43.0
^ permalink raw reply related
* [PATCH v3 2/6] rust: doctest: add LocalModule fallback for #[vtable] ThisModule
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Add a `LocalModule` struct with a null-pointer `ModuleMetadata` impl
in the doctest harness, so that `crate::LocalModule` (auto-inserted
by `#[vtable]`) resolves correctly when there is no `module!` macro.
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
scripts/rustdoc_test_gen.rs | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/scripts/rustdoc_test_gen.rs b/scripts/rustdoc_test_gen.rs
index ee76e96b41eea..198af4e446c8c 100644
--- a/scripts/rustdoc_test_gen.rs
+++ b/scripts/rustdoc_test_gen.rs
@@ -239,6 +239,22 @@ macro_rules! assert_eq {{
const __LOG_PREFIX: &[u8] = b"rust_doctests_kernel\0";
+/// Dummy module type for doctest context.
+struct LocalModule;
+
+use kernel::{{
+ str::CStr,
+ ModuleMetadata,
+ ThisModule, //
+}};
+use core::ptr::null_mut;
+
+impl ModuleMetadata for LocalModule {{
+ const NAME: &'static CStr = c"rust_doctests_kernel";
+ // SAFETY: `try_module_get`/`module_put` handle null module pointers gracefully.
+ const THIS_MODULE: ThisModule = unsafe {{ ThisModule::from_ptr(null_mut()) }};
+}}
+
{rust_tests}
"#
)
--
2.43.0
^ permalink raw reply related
* [PATCH v3 3/6] rust: macros: auto-insert OwnerModule in #[vtable]
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Auto-add `type OwnerModule: ::kernel::ModuleMetadata;` as a required
associated type on the trait side if not already defined, and
auto-insert `type OwnerModule = crate::LocalModule;` on the impl side
if not explicitly provided, eliminating the need to manually declare
and implement `OwnerModule` in every vtable trait and impl.
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Suggested-by: Gary Guo <gary@garyguo.net>
Link: https://lore.kernel.org/all/DIMMWHUOLPSH.13JFRHDKDQJGO@garyguo.net
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
rust/macros/lib.rs | 6 ++++++
rust/macros/vtable.rs | 41 ++++++++++++++++++++++++++++++++++++-----
2 files changed, 42 insertions(+), 5 deletions(-)
diff --git a/rust/macros/lib.rs b/rust/macros/lib.rs
index 2cfd59e0f9e7c..bc7ded353c5ca 100644
--- a/rust/macros/lib.rs
+++ b/rust/macros/lib.rs
@@ -176,6 +176,12 @@ pub fn module(input: TokenStream) -> TokenStream {
///
/// This macro should not be used when all functions are required.
///
+/// Additionally, this macro automatically handles the `OwnerModule`
+/// associated type: on the trait side, `type OwnerModule: ModuleMetadata;`
+/// is added as a required associated type if not already defined; on the
+/// impl side, `type OwnerModule = LocalModule;` is automatically inserted
+/// if not explicitly defined.
+///
/// # Examples
///
/// ```
diff --git a/rust/macros/vtable.rs b/rust/macros/vtable.rs
index c6510b0c4ea1d..be9a5ed8abe5e 100644
--- a/rust/macros/vtable.rs
+++ b/rust/macros/vtable.rs
@@ -30,6 +30,22 @@ fn handle_trait(mut item: ItemTrait) -> Result<ItemTrait> {
const USE_VTABLE_ATTR: ();
});
+ // Add `type OwnerModule: ModuleMetadata` as a required associated type if
+ // the trait does not already define it.
+ if !item
+ .items
+ .iter()
+ .any(|i| matches!(i, TraitItem::Type(t) if t.ident == "OwnerModule"))
+ {
+ gen_items.push(parse_quote! {
+ /// The module implementing this vtable trait.
+ ///
+ /// Automatically set to `crate::LocalModule` by the `#[vtable]`
+ /// impl macro.
+ type OwnerModule: ::kernel::ModuleMetadata;
+ });
+ }
+
for item in &item.items {
if let TraitItem::Fn(fn_item) = item {
let name = &fn_item.sig.ident;
@@ -57,12 +73,18 @@ fn handle_trait(mut item: ItemTrait) -> Result<ItemTrait> {
fn handle_impl(mut item: ItemImpl) -> Result<ItemImpl> {
let mut gen_items = Vec::new();
- let mut defined_consts = HashSet::new();
+ let mut defined_items = HashSet::new();
- // Iterate over all user-defined constants to gather any possible explicit overrides.
+ // Iterate over all user-defined items to gather any possible explicit overrides.
for item in &item.items {
- if let ImplItem::Const(const_item) = item {
- defined_consts.insert(const_item.ident.clone());
+ match item {
+ ImplItem::Const(const_item) => {
+ defined_items.insert(const_item.ident.clone());
+ }
+ ImplItem::Type(type_item) => {
+ defined_items.insert(type_item.ident.clone());
+ }
+ _ => {}
}
}
@@ -70,6 +92,15 @@ fn handle_impl(mut item: ItemImpl) -> Result<ItemImpl> {
const USE_VTABLE_ATTR: () = ();
});
+ // Auto-insert `type OwnerModule = crate::LocalModule` if not explicitly defined.
+ // `crate::LocalModule` resolves to the real module type (via `module!`) or a
+ // dummy fallback in non-module contexts (e.g., doctests).
+ if !defined_items.contains(&parse_quote!(OwnerModule)) {
+ gen_items.push(parse_quote! {
+ type OwnerModule = crate::LocalModule;
+ });
+ }
+
for item in &item.items {
if let ImplItem::Fn(fn_item) = item {
let name = &fn_item.sig.ident;
@@ -78,7 +109,7 @@ fn handle_impl(mut item: ItemImpl) -> Result<ItemImpl> {
name.span(),
);
// Skip if it's declared already -- this allows user override.
- if defined_consts.contains(&gen_const_name) {
+ if defined_items.contains(&gen_const_name) {
continue;
}
let cfg_attrs = crate::helpers::gather_cfg_attrs(&fn_item.attrs);
--
2.43.0
^ permalink raw reply related
* [PATCH v3 1/6] rust: module: add `THIS_MODULE` const to `ModuleMetadata` trait
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
In-Reply-To: <20260622-fix-fops-owner-v3-0-49d45cb37032@linux.dev>
Since `const_refs_to_static` has been stable as of the MSRV bump, a
`ThisModule` pointer can now be used in const contexts.
Add a `THIS_MODULE` const to the `ModuleMetadata` trait so that modules
can provide their `ThisModule` pointer in const contexts such as static
`file_operations`.
Move the `THIS_MODULE` static from the `module!` macro into the
`ModuleMetadata` impl, add a `this_module()` helper, and update `__init`
to use it.
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
rust/kernel/lib.rs | 8 ++++++++
rust/macros/module.rs | 34 +++++++++++++++++-----------------
2 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index b72b2fbe046d6..50f5a7b5f028e 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -184,6 +184,14 @@ fn init(module: &'static ThisModule) -> impl pin_init::PinInit<Self, error::Erro
pub trait ModuleMetadata {
/// The name of the module as specified in the `module!` macro.
const NAME: &'static crate::str::CStr;
+
+ /// The module's `THIS_MODULE` pointer.
+ const THIS_MODULE: ThisModule;
+}
+
+/// Returns a reference to the `THIS_MODULE` of the given module type.
+pub const fn this_module<M: ModuleMetadata>() -> &'static ThisModule {
+ &M::THIS_MODULE
}
/// Equivalent to `THIS_MODULE` in the C API.
diff --git a/rust/macros/module.rs b/rust/macros/module.rs
index 06c18e2075083..b9fdee2f2af47 100644
--- a/rust/macros/module.rs
+++ b/rust/macros/module.rs
@@ -497,28 +497,28 @@ pub(crate) fn module(info: ModuleInfo) -> Result<TokenStream> {
/// Used by the printing macros, e.g. [`info!`].
const __LOG_PREFIX: &[u8] = #name_cstr.to_bytes_with_nul();
- // SAFETY: `__this_module` is constructed by the kernel at load time and will not be
- // freed until the module is unloaded.
- #[cfg(MODULE)]
- static THIS_MODULE: ::kernel::ThisModule = unsafe {
- extern "C" {
- static __this_module: ::kernel::types::Opaque<::kernel::bindings::module>;
- };
-
- ::kernel::ThisModule::from_ptr(__this_module.get())
- };
-
- #[cfg(not(MODULE))]
- static THIS_MODULE: ::kernel::ThisModule = unsafe {
- ::kernel::ThisModule::from_ptr(::core::ptr::null_mut())
- };
-
/// The `LocalModule` type is the type of the module created by `module!`,
/// `module_pci_driver!`, `module_platform_driver!`, etc.
type LocalModule = #type_;
impl ::kernel::ModuleMetadata for #type_ {
const NAME: &'static ::kernel::str::CStr = #name_cstr;
+
+ #[cfg(MODULE)]
+ const THIS_MODULE: ::kernel::ThisModule = {
+ extern "C" {
+ static __this_module: ::kernel::types::Opaque<::kernel::bindings::module>;
+ }
+
+ // SAFETY: `__this_module` is constructed by the kernel at load time
+ // and lives until the module is unloaded.
+ unsafe { ::kernel::ThisModule::from_ptr(__this_module.get()) }
+ };
+
+ #[cfg(not(MODULE))]
+ const THIS_MODULE: ::kernel::ThisModule = unsafe {
+ ::kernel::ThisModule::from_ptr(::core::ptr::null_mut())
+ };
}
// Double nested modules, since then nobody can access the public items inside.
@@ -616,7 +616,7 @@ pub extern "C" fn #ident_exit() {
/// This function must only be called once.
unsafe fn __init() -> ::kernel::ffi::c_int {
let initer = <super::super::LocalModule as ::kernel::InPlaceModule>::init(
- &super::super::THIS_MODULE
+ ::kernel::this_module::<super::super::LocalModule>()
);
// SAFETY: No data race, since `__MOD` can only be accessed by this module
// and there only `__init` and `__exit` access it. These functions are only
--
2.43.0
^ permalink raw reply related
* [PATCH v3 0/6] Fix missing fops.owner in Rust DRM/misc abstractions
From: Alvin Sun @ 2026-06-22 2:44 UTC (permalink / raw)
To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, Alvin Sun
During tyr debugfs development, a kernel NULL pointer dereference was
encountered after `rmmod tyr` while gnome-shell still held /dev/card1 open:
```
[158827.868132] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[158827.868918] Mem abort info:
[158827.869177] ESR = 0x0000000086000004
[158827.869519] EC = 0x21: IABT (current EL), IL = 32 bits
[158827.870000] SET = 0, FnV = 0
[158827.870281] EA = 0, S1PTW = 0
[158827.870571] FSC = 0x04: level 0 translation fault
[158827.871043] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000108dec000
[158827.871623] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[158827.872242] Internal error: Oops: 0000000086000004 [#1] SMP
[158827.872246] Modules linked in: tyr sunrpc snd_soc_simple_card rk805_pwrkey snd_soc_simple_card_utils rtw88_8822bu display_connector rtw88_usb rtw88_8822b snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec
rtw88_core]
[158827.872337] CPU: 4 UID: 1000 PID: 11276 Comm: gnome-s:disk$0 Tainted: G N 7.1.0-rc1+ #331 PREEMPT
[158827.880534] Tainted: [N]=TEST
[158827.880535] Hardware name: FriendlyElec NanoPi R6C/NanoPi R6C, BIOS v1.1 04/09/2025
[158827.880538] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[158827.880542] pc : 0x0
[158827.880547] lr : _RNvNtCs257m05FHVbX_3tyr2vm8pt_unmap+0x8c/0x12c [tyr]
[158827.880578] sp : ffff800083c236b0
[158827.880579] x29: ffff800083c236d0 x28: ffff00013f8a0000 x27: 0000000000000000
[158827.880585] x26: 000000000000007c x25: ffff000108e6ed80 x24: 0000000000401000
[158827.880590] x23: 0000000000000000 x22: 0000000040000000 x21: 0000000000001000
[158827.880595] x20: ffff00010f778138 x19: 0000000000400000 x18: 00000000ffffffff
[158827.880600] x17: 000000040044ffff x16: 045000f2b5503510 x15: 0720072007200720
[158827.880606] x14: 0720072007200720 x13: 0000000000401000 x12: 0000000000400000
[158827.880611] x11: ffff800083c239d0 x10: ffff000141e4fd88 x9 : 0000000000000000
[158827.880615] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000400000
[158827.880620] x5 : ffff00013f8a0000 x4 : 0000000000000000 x3 : 0000000000000001
[158827.880625] x2 : 0000000000001000 x1 : 0000000000400000 x0 : ffff00010f778138
[158827.880630] Call trace:
[158827.880632] 0x0 (P)
[158827.880635] _RNvXs6_NtCs257m05FHVbX_3tyr2vmNtB5_9GpuVmDataNtNtNtCsgmSOfgXi5CZ_6kernel3drm5gpuvm11DriverGpuVm13sm_step_unmap+0x3c/0x120 [tyr]
[158827.891166] _RNvMs4_NtNtNtCsgmSOfgXi5CZ_6kernel3drm5gpuvm6sm_opsINtB7_5GpuVmNtNtCs257m05FHVbX_3tyr2vm9GpuVmDataE13sm_step_unmapB13_+0x18/0x34 [tyr]
[158827.891187] op_unmap_cb+0x78/0xb0
[158827.891196] __drm_gpuvm_sm_unmap+0x18c/0x1b4
[158827.891204] drm_gpuvm_sm_unmap+0x38/0x4c
[158827.891209] _RNvMs5_NtCs257m05FHVbX_3tyr2vmNtB5_2Vm7exec_op+0x1cc/0x254 [tyr]
[158827.894085] _RNvMs5_NtCs257m05FHVbX_3tyr2vmNtB5_2Vm11unmap_range+0x124/0x188 [tyr]
[158827.894105] _RINvNtCs5hGKnPbRUFW_4core3ptr13drop_in_placeNtNtCs257m05FHVbX_3tyr3gem8KernelBoEBK_+0x44/0xd8 [tyr]
[158827.894125] _RINvNtCs5hGKnPbRUFW_4core3ptr13drop_in_placeINtNtNtCsgmSOfgXi5CZ_6kernel5alloc4kvec3VecNtNtCs257m05FHVbX_3tyr2fw7SectionNtNtBL_9allocator7KmallocEEB1r_+0x3c/0x100 [tyr]
[158827.894147] _RINvNtCs5hGKnPbRUFW_4core3ptr13drop_in_placeINtNtNtCsgmSOfgXi5CZ_6kernel4sync3arc3ArcNtNtCs257m05FHVbX_3tyr2fw8FirmwareEEB1p_+0x94/0x190 [tyr]
[158827.894167] _RNvMs4_NtNtCsgmSOfgXi5CZ_6kernel3drm6deviceINtB5_6DeviceNtNtCs257m05FHVbX_3tyr6driver12TyrDrmDriverE7releaseBW_+0x30/0x98 [tyr]
[158827.899550] drm_dev_put.part.0+0x88/0xc0
[158827.899557] drm_minor_release+0x18/0x28
[158827.899562] drm_release+0x144/0x170
[158827.899567] __fput+0xe4/0x30c
[158827.899573] ____fput+0x14/0x20
[158827.899579] task_work_run+0x7c/0xe8
[158827.899586] do_exit+0x2a8/0xac4
[158827.899590] do_group_exit+0x34/0x90
[158827.899594] get_signal+0xaac/0xabc
[158827.899599] arch_do_signal_or_restart+0x90/0x3e8
[158827.899606] exit_to_user_mode_loop+0x140/0x1d0
[158827.899613] el0_svc+0x2f4/0x2f8
[158827.899620] el0t_64_sync_handler+0xa0/0xe4
[158827.899627] el0t_64_sync+0x198/0x19c
[158827.899632] ---[ end trace 0000000000000000 ]---
```
The root cause: `fops.owner` was `NULL` in Rust DRM drivers, so the kernel
never blocked module unloading while file descriptors were open. This leads to
use-after-free when drm_release (or other fops) is called on freed module code.
The series moves `THIS_MODULE` into the `ModuleMetadata` as a const, threads it
through `#[vtable]` to set `fops.owner` in DRM/miscdevice, and updates configfs
and rnull to use `this_module::<LocalModule>()`.
Assisted-by: opencode:glm-5.2
Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
---
Changes in v3:
- Renamed vtable associated type `ThisModule` to `OwnerModule`
- Added `this_module()` helper for ergonomic `THIS_MODULE` access
- Refined vtable macro implementation: one-liner detection and single `defined_items` set
- Reordered commits to place doctest fallback before vtable auto-insert
- Link to v2: https://lore.kernel.org/r/20260521-fix-fops-owner-v2-0-fd99079c5a04@linux.dev
Changes in v2:
- Merged old `static THIS_MODULE` and v1's `MODULE_PTR` into a single
`ModuleMetadata::THIS_MODULE` const
- `#[vtable]` macro now auto-inserts `type ThisModule`, removing all per-driver
manual patches from v1
- Added configfs & rnull usage site updates and doctest `LocalModule` fallback
- Link to v1: https://lore.kernel.org/r/20260519-fix-fops-owner-v1-0-2ded9830da14@linux.dev
---
Alvin Sun (6):
rust: module: add `THIS_MODULE` const to `ModuleMetadata` trait
rust: doctest: add LocalModule fallback for #[vtable] ThisModule
rust: macros: auto-insert OwnerModule in #[vtable]
rust: drm: set fops.owner from driver module pointer
rust: miscdevice: set fops.owner from driver module pointer
rust: configfs: use `LocalModule` for `THIS_MODULE`
drivers/block/rnull/configfs.rs | 6 ++----
rust/kernel/configfs.rs | 8 +++++---
rust/kernel/drm/device.rs | 3 ++-
rust/kernel/drm/gem/mod.rs | 4 ++--
rust/kernel/lib.rs | 8 ++++++++
rust/kernel/miscdevice.rs | 4 +++-
rust/macros/lib.rs | 6 ++++++
rust/macros/module.rs | 34 +++++++++++++++++-----------------
rust/macros/vtable.rs | 41 ++++++++++++++++++++++++++++++++++++-----
scripts/rustdoc_test_gen.rs | 16 ++++++++++++++++
10 files changed, 97 insertions(+), 33 deletions(-)
---
base-commit: b7e5ac83cb16f7ffd11dc23736f84276602100ed
change-id: 20260519-fix-fops-owner-e3a77bb27c6c
prerequisite-change-id: 20260519-miscdev-use-format-9ab7e83b1c11:v3
prerequisite-patch-id: 405b334ff0d48ad350014f05a2321bdbaa025400
prerequisite-patch-id: 604b631c81d5423f4ebb2e12ba2d22e9ce371bfc
prerequisite-patch-id: cb550d94cefe01920e0d3ced2b2bcbecd76f3907
prerequisite-patch-id: 3bc830839742591460cb86d9472c04f4686dc600
prerequisite-patch-id: 571058244bc8c7088638d2e3225713011246c7e9
prerequisite-patch-id: 347c5a3c6dbef9832bfce8419fc23e6e08ba477f
prerequisite-patch-id: 3e202d988b56b88446f7535e90d3f00cf5f15701
Best regards,
--
Alvin Sun <alvin.sun@linux.dev>
^ permalink raw reply
* Re: [PATCH] nbd: don't warn when reclassifying a busy socket lock
From: Hillf Danton @ 2026-06-22 1:43 UTC (permalink / raw)
To: Deepanshu Kartikey
Cc: edumazet, linux-block, nbd, linux-kernel,
syzbot+6b85d1e39a5b8ed9a954
In-Reply-To: <20260621235255.66015-1-kartikey406@gmail.com>
On Mon, 22 Jun 2026 05:22:55 +0530 Deepanshu Kartikey wrote:
> nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is
> held at the point of reclassification. That assertion was copied from
> nvme-tcp, where the socket is created internally by the kernel
> (sock_create_kern()) and is never visible to user space, so the lock
> is guaranteed to be free.
>
> NBD is different: the socket is looked up from a user-supplied fd in
> nbd_get_socket(), and user space retains that fd. A concurrent syscall
> on the same socket (or softirq processing taking bh_lock_sock() on a
> connected TCP socket) can legitimately hold the lock at the instant
> NBD reclassifies it. sock_allow_reclassification() then returns false
> and the WARN_ON_ONCE() fires, which turns into a crash under
> panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT
> against socket activity on the same fd, as reported by syzbot.
>
Given the syzbot report, if you are right (I suspect) then Eric delivered
another half-baked croissant, and feel free to cut it off instead to make
room for correct fix.
> Hitting a held lock here is expected for an externally owned socket and
> is not a kernel bug, so skip reclassification silently instead of
> warning. Reclassification is a lockdep-only annotation, so skipping it
> in the rare racing case is harmless.
>
> Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954
> Fixes: d532cddb6c60 ("nbd: Reclassify sockets to avoid lockdep circular dependency")
> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
> ---
> drivers/block/nbd.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 3a585a0c882a..8f10762e90ef 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1246,7 +1246,7 @@ static void nbd_reclassify_socket(struct socket *sock)
> {
> struct sock *sk = sock->sk;
>
> - if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
> + if (!sock_allow_reclassification(sk))
> return;
>
> switch (sk->sk_family) {
> --
> 2.43.0
^ permalink raw reply
* [PATCH] nbd: don't warn when reclassifying a busy socket lock
From: Deepanshu Kartikey @ 2026-06-21 23:52 UTC (permalink / raw)
To: josef, axboe, edumazet
Cc: linux-block, nbd, linux-kernel, Deepanshu Kartikey,
syzbot+6b85d1e39a5b8ed9a954
nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is
held at the point of reclassification. That assertion was copied from
nvme-tcp, where the socket is created internally by the kernel
(sock_create_kern()) and is never visible to user space, so the lock
is guaranteed to be free.
NBD is different: the socket is looked up from a user-supplied fd in
nbd_get_socket(), and user space retains that fd. A concurrent syscall
on the same socket (or softirq processing taking bh_lock_sock() on a
connected TCP socket) can legitimately hold the lock at the instant
NBD reclassifies it. sock_allow_reclassification() then returns false
and the WARN_ON_ONCE() fires, which turns into a crash under
panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT
against socket activity on the same fd, as reported by syzbot.
Hitting a held lock here is expected for an externally owned socket and
is not a kernel bug, so skip reclassification silently instead of
warning. Reclassification is a lockdep-only annotation, so skipping it
in the rare racing case is harmless.
Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954
Fixes: d532cddb6c60 ("nbd: Reclassify sockets to avoid lockdep circular dependency")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
---
drivers/block/nbd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3a585a0c882a..8f10762e90ef 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1246,7 +1246,7 @@ static void nbd_reclassify_socket(struct socket *sock)
{
struct sock *sk = sock->sk;
- if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
+ if (!sock_allow_reclassification(sk))
return;
switch (sk->sk_family) {
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] block, bfq: protect async queue reset with blkcg locks
From: Tao Cui @ 2026-06-21 18:33 UTC (permalink / raw)
To: Cen Zhang, Yu Kuai, Tejun Heo, Josef Bacik, Jens Axboe,
Arianna Avanzini, Paolo Valente
Cc: cui.tao, linux-block, cgroups, linux-kernel, baijiaju1990
In-Reply-To: <20260621135930.2657810-1-zzzccc427@gmail.com>
Nice catch. The race is real, and the fix lines up with how the rest
of the blkcg code already protects blkg_list walks — the new nesting
(blkcg_mutex -> queue_lock -> bfqd->lock) is the same order
blkg_free_workfn() and bfq_pd_offline() use, so no inversion.
Reviewed-by: Tao Cui <cuitao@kylinos.cn>
在 2026/6/21 21:59, Cen Zhang 写道:
> Writing 0 to BFQ's low_latency attribute ends weight raising for active,
> idle and async queues. The async cgroup path walks q->blkg_list, converts
> each blkg to BFQ policy data and then reads bfqg->async_bfqq and
> bfqg->async_idle_bfqq.
>
> That walk was protected only by bfqd->lock. blkcg release work is
> serialized by q->blkcg_mutex and q->queue_lock instead, and
> blkg_free_workfn() can call BFQ's pd_free_fn before it removes
> blkg->q_node from q->blkg_list. A low_latency reset can therefore still
> find the blkg on the queue list after the BFQ policy data has been freed.
>
> The buggy scenario involves two paths, with each column showing the order
> within that path:
>
> BFQ low_latency reset: blkcg blkg release work:
> 1. bfq_low_latency_store() 1. blkg_free_workfn() takes
> calls bfq_end_wr(). q->blkcg_mutex.
> 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
> q->blkg_list. final bfq_group reference.
> 3. blkg_to_bfqg() returns 3. blkg->q_node remains on
> the stale policy data. q->blkg_list until list_del_init().
> 4. bfq_end_wr_async_queues()
> reads async queue fields.
>
> Fix this by taking q->blkcg_mutex and q->queue_lock around the
> q->blkg_list walk, then taking bfqd->lock before touching BFQ async
> queues. The mutex serializes against policy-data free and queue_lock
> stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
> bfqd->lock critical section so the lock order matches blkcg policy
> callbacks.
^ permalink raw reply
* [PATCH] block, bfq: protect async queue reset with blkcg locks
From: Cen Zhang @ 2026-06-21 13:59 UTC (permalink / raw)
To: Yu Kuai, Tejun Heo, Josef Bacik, Jens Axboe, Arianna Avanzini,
Paolo Valente
Cc: linux-block, cgroups, linux-kernel, baijiaju1990, zzzccc427
Writing 0 to BFQ's low_latency attribute ends weight raising for active,
idle and async queues. The async cgroup path walks q->blkg_list, converts
each blkg to BFQ policy data and then reads bfqg->async_bfqq and
bfqg->async_idle_bfqq.
That walk was protected only by bfqd->lock. blkcg release work is
serialized by q->blkcg_mutex and q->queue_lock instead, and
blkg_free_workfn() can call BFQ's pd_free_fn before it removes
blkg->q_node from q->blkg_list. A low_latency reset can therefore still
find the blkg on the queue list after the BFQ policy data has been freed.
The buggy scenario involves two paths, with each column showing the order
within that path:
BFQ low_latency reset: blkcg blkg release work:
1. bfq_low_latency_store() 1. blkg_free_workfn() takes
calls bfq_end_wr(). q->blkcg_mutex.
2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
q->blkg_list. final bfq_group reference.
3. blkg_to_bfqg() returns 3. blkg->q_node remains on
the stale policy data. q->blkg_list until list_del_init().
4. bfq_end_wr_async_queues()
reads async queue fields.
Fix this by taking q->blkcg_mutex and q->queue_lock around the
q->blkg_list walk, then taking bfqd->lock before touching BFQ async
queues. The mutex serializes against policy-data free and queue_lock
stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
bfqd->lock critical section so the lock order matches blkcg policy
callbacks.
Validation reproduced this kernel report:
BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340
Call Trace:
<TASK>
dump_stack_lvl+0x66/0xa0
print_report+0xce/0x630
? bfq_end_wr_async_queues+0x246/0x340
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x20d/0x410
? bfq_end_wr_async_queues+0x246/0x340
kasan_report+0xe0/0x110
? bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async+0xba/0x180
bfq_low_latency_store+0x4e5/0x690
? 0xffffffffc02150da
? __pfx_bfq_low_latency_store+0x10/0x10
? __pfx_bfq_low_latency_store+0x10/0x10
elv_attr_store+0xc4/0x110
kernfs_fop_write_iter+0x2f5/0x4a0
vfs_write+0x604/0x11f0
? __pfx_locks_remove_posix+0x10/0x10
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf9/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 544:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
bfq_pd_alloc+0xc0/0x1b0
blkg_alloc+0x346/0x960
blkg_create+0x8c2/0x10d0
bio_associate_blkg_from_css+0x9f3/0xfa0
bio_associate_blkg+0xd9/0x200
bio_init+0x303/0x640
__blkdev_direct_IO_simple+0x56b/0x8a0
blkdev_direct_IO+0x8e7/0x2580
blkdev_read_iter+0x205/0x400
vfs_read+0x7b0/0xda0
ksys_read+0xf9/0x1d0
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 465:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x5f/0x80
kfree+0x307/0x580
blkg_free_workfn+0xef/0x460
process_one_work+0x8d0/0x1870
worker_thread+0x575/0xf80
kthread+0x2e7/0x3c0
ret_from_fork+0x576/0x810
ret_from_fork_asm+0x1a/0x30
Fixes: 44e44a1b329e ("block, bfq: improve responsiveness")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
block/bfq-cgroup.c | 13 ++++++++++++-
block/bfq-iosched.c | 3 ++-
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 0bd0332b3d78..d8fdace464b4 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -936,14 +936,23 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
void bfq_end_wr_async(struct bfq_data *bfqd)
{
+ struct request_queue *q = bfqd->queue;
struct blkcg_gq *blkg;
- list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
+ mutex_lock(&q->blkcg_mutex);
+ spin_lock_irq(&q->queue_lock);
+ spin_lock(&bfqd->lock);
+
+ list_for_each_entry(blkg, &q->blkg_list, q_node) {
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
bfq_end_wr_async_queues(bfqd, bfqg);
}
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+
+ spin_unlock(&bfqd->lock);
+ spin_unlock_irq(&q->queue_lock);
+ mutex_unlock(&q->blkcg_mutex);
}
static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
@@ -1416,7 +1425,9 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
void bfq_end_wr_async(struct bfq_data *bfqd)
{
+ spin_lock_irq(&bfqd->lock);
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+ spin_unlock_irq(&bfqd->lock);
}
struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 141c602d5e85..eec9be62061b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2653,9 +2653,10 @@ static void bfq_end_wr(struct bfq_data *bfqd)
}
list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
bfq_bfqq_end_wr(bfqq);
- bfq_end_wr_async(bfqd);
spin_unlock_irq(&bfqd->lock);
+
+ bfq_end_wr_async(bfqd);
}
static sector_t bfq_io_struct_pos(void *io_struct, bool request)
--
2.43.0
^ permalink raw reply related
* [PATCH] blk-iolatency: flush enable work after policy deactivation
From: Cen Zhang @ 2026-06-21 13:59 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: cgroups, linux-block, linux-kernel, baijiaju1990, zzzccc427
A blk-iolatency rq-qos teardown can free struct blk_iolatency while a
freshly queued enable_work callback still references it. The observed
failure is:
blkcg_iolatency_exit() flushes enable_work before deactivating the
iolatency policy. However, blkcg_deactivate_policy() calls
iolatency_pd_offline() for online policy data, and iolatency_pd_offline()
clears min_lat_nsec through iolatency_set_min_lat_nsec(). If this clears
the last nonzero latency target, enable_cnt reaches zero and schedules
enable_work again after the flush has already returned.
The buggy scenario involves two paths, with each column showing the order
within that path:
blkcg_iolatency_exit() path: system_wq worker path:
1. Flush old enable_work. 1. enable_work is idle.
2. Deactivate the policy. 2. no worker owns it.
3. Offline queues new enable_work. 3. work item becomes pending.
4. Free blkiolat. 4. worker later runs the item.
5. Owner storage is gone. 5. worker dereferences blkiolat.
Flush enable_work again after blkcg_deactivate_policy() returns and before
freeing blkiolat. Policy offline callbacks have completed at that point,
so the second drain covers the late queueing path without changing the
normal enable/disable accounting rules.
Validation reproduced this kernel report:
BUG: KASAN: slab-use-after-free in assign_work+0x2a/0x150
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_report+0xd0/0x630
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0xea/0x1a0
? assign_work+0x2a/0x150
kasan_report+0xce/0x100
? assign_work+0x2a/0x150
assign_work+0x2a/0x150
worker_thread+0x1b7/0x500
? __pfx_worker_thread+0x10/0x10
kthread+0x192/0x1d0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2ac/0x3c0
? __pfx_ret_from_fork+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? __switch_to+0x2d5/0x6e0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 470:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0x8f/0xa0
iolatency_set_limit+0x301/0x450
cgroup_file_write+0x178/0x2e0
kernfs_fop_write_iter+0x1ef/0x290
vfs_write+0x446/0x6f0
ksys_write+0xc7/0x160
do_syscall_64+0xf9/0x540
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 611:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x43/0x70
kfree+0x131/0x390
rq_qos_exit+0x5d/0x90
__del_gendisk+0x394/0x490
del_gendisk+0xa1/0xe0
virtblk_remove+0x41/0xd0
virtio_dev_remove+0x63/0xe0
device_release_driver_internal+0x246/0x2e0
unbind_store+0xa9/0xb0
kernfs_fop_write_iter+0x1ef/0x290
vfs_write+0x446/0x6f0
ksys_write+0xc7/0x160
do_syscall_64+0xf9/0x540
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Last potentially related work creation:
kasan_save_stack+0x33/0x60
kasan_record_aux_stack+0x8c/0xa0
__queue_work+0x42a/0x800
queue_work_on+0x5d/0x70
iolatency_set_min_lat_nsec+0x196/0x230
iolatency_pd_offline+0x1f/0x40
blkcg_deactivate_policy+0x194/0x270
blkcg_iolatency_exit+0x33/0x40
rq_qos_exit+0x5d/0x90
__del_gendisk+0x394/0x490
del_gendisk+0xa1/0xe0
virtblk_remove+0x41/0xd0
virtio_dev_remove+0x63/0xe0
device_release_driver_internal+0x246/0x2e0
unbind_store+0xa9/0xb0
kernfs_fop_write_iter+0x1ef/0x290
vfs_write+0x446/0x6f0
ksys_write+0xc7/0x160
do_syscall_64+0xf9/0x540
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Second to last potentially related work creation:
kasan_save_stack+0x33/0x60
kasan_record_aux_stack+0x8c/0xa0
__queue_work+0x42a/0x800
queue_work_on+0x5d/0x70
iolatency_set_min_lat_nsec+0x196/0x230
iolatency_set_limit+0x3f1/0x450
cgroup_file_write+0x178/0x2e0
kernfs_fop_write_iter+0x1ef/0x290
vfs_write+0x446/0x6f0
ksys_write+0xc7/0x160
do_syscall_64+0xf9/0x540
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 8a177a36da6c ("blk-iolatency: Fix inflight count imbalances and IO hangs on offline")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
block/blk-iolatency.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 1aaee6fb0f59..a0bdd8a5c94c 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -639,6 +639,11 @@ static void blkcg_iolatency_exit(struct rq_qos *rqos)
timer_shutdown_sync(&blkiolat->timer);
flush_work(&blkiolat->enable_work);
blkcg_deactivate_policy(rqos->disk, &blkcg_policy_iolatency);
+ /*
+ * blkcg_deactivate_policy() invokes iolatency_pd_offline(), which may
+ * queue enable_work again when it clears the last latency target.
+ */
+ flush_work(&blkiolat->enable_work);
kfree(blkiolat);
}
--
2.43.0
^ permalink raw reply related
* [syzbot] [nbd?] WARNING in nbd_add_socket
From: syzbot @ 2026-06-21 6:23 UTC (permalink / raw)
To: axboe, josef, linux-block, linux-kernel, nbd, netdev,
syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: b85966adbf5d Merge tag 'net-next-7.2' of git://git.kernel...
git tree: net
console output: https://syzkaller.appspot.com/x/log.txt?x=101f6d56580000
kernel config: https://syzkaller.appspot.com/x/.config?x=9a9f723a32776544
dashboard link: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954
compiler: Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13584aae580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11fd7b7a580000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/780edcc3cc37/disk-b85966ad.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/967dd18c7ecd/vmlinux-b85966ad.xz
kernel image: https://storage.googleapis.com/syzbot-assets/cf9fa92c90ff/bzImage-b85966ad.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com
netlink: 3936 bytes leftover after parsing attributes in process `syz.0.25'.
------------[ cut here ]------------
!sock_allow_reclassification(sk)
WARNING: drivers/block/nbd.c:1249 at nbd_reclassify_socket drivers/block/nbd.c:1249 [inline], CPU#0: syz.0.25/5992
WARNING: drivers/block/nbd.c:1249 at nbd_add_socket+0xf35/0x12c0 drivers/block/nbd.c:1293, CPU#0: syz.0.25/5992
Modules linked in:
CPU: 0 UID: 0 PID: 5992 Comm: syz.0.25 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:nbd_reclassify_socket drivers/block/nbd.c:1249 [inline]
RIP: 0010:nbd_add_socket+0xf35/0x12c0 drivers/block/nbd.c:1293
Code: f7 e8 6f b5 20 fc bf e0 01 00 00 49 03 3e 48 c7 c6 40 02 55 8c e8 2b a8 1b fb b8 f0 ff ff ff e9 b2 fd ff ff e8 ac 60 b5 fb 90 <0f> 0b 90 e9 16 f8 ff ff e8 5e 2e 97 05 44 89 e9 80 e1 07 fe c1 38
RSP: 0018:ffffc90002ef7160 EFLAGS: 00010293
RAX: ffffffff86109574 RBX: 1ffff1100651ddb9 RCX: ffff888020b68000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffffc90002ef7250 R08: ffff888035af2bdf R09: 1ffff11006b5e57b
R10: dffffc0000000000 R11: ffffed1006b5e57c R12: ffff8880328eec00
R13: 1ffff920005dee38 R14: dffffc0000000000 R15: 0000000000000001
FS: 00007fcc9d5dd6c0(0000) GS:ffff88812527c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81a8ab50f0 CR3: 0000000078780000 CR4: 00000000003526f0
Call Trace:
<TASK>
nbd_genl_connect+0x133d/0x1c10 drivers/block/nbd.c:2254
genl_family_rcv_msg_doit+0x233/0x340 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x614/0x7a0 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x226/0x4a0 net/netlink/af_netlink.c:2556
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x7bb/0x940 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1900
sock_sendmsg_nosec net/socket.c:775 [inline]
__sock_sendmsg net/socket.c:790 [inline]
____sys_sendmsg+0x9b9/0xa20 net/socket.c:2684
___sys_sendmsg+0x2a5/0x360 net/socket.c:2738
__sys_sendmsg net/socket.c:2770 [inline]
__do_sys_sendmsg net/socket.c:2775 [inline]
__se_sys_sendmsg net/socket.c:2773 [inline]
__x64_sys_sendmsg+0x1b1/0x290 net/socket.c:2773
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcc9df9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fcc9d5dd028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fcc9e216090 RCX: 00007fcc9df9ce59
RDX: 0000000000004040 RSI: 0000200000000140 RDI: 0000000000000004
RBP: 00007fcc9e032d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fcc9e216128 R14: 00007fcc9e216090 R15: 00007ffc8f827678
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* [PATCH] Documentation: ABI: fix "unexpected indentation" error in sysfs-block
From: Jay Winston @ 2026-06-21 6:02 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, linux-kernel, corbet, Jay Winston
`make htmldocs` reports:
Documentation/ABI/stable/sysfs-block:612: ERROR: Unexpected indentation
Leading dashes at lines 623, 636, and 641 were considered line
continuation with errant indent and not bullet points due to
missing blank lines. Add the blank lines.
Signed-off-by: Jay Winston <jaybenjaminwinston@gmail.com>
---
Documentation/ABI/stable/sysfs-block | 3 +++
1 file changed, 3 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index aa1e94169666..f4bce370a540 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -620,6 +620,7 @@ Description:
- async_depth is always equal to nr_requests.
For bfq scheduler:
+
- By default, async_depth is set to 75% of nr_requests.
Internal limits are then derived from this value:
* Sync writes: limited to async_depth (≈75% of nr_requests).
@@ -633,11 +634,13 @@ Description:
these limits proportionally based on the new value.
For Kyber:
+
- By default async_depth is set to 75% of nr_requests.
- If the user writes a custom value to async_depth, then it override the
default and directly control the limit for writes and async I/O.
For mq-deadline:
+
- By default async_depth is set to nr_requests.
- If the user writes a custom value to async_depth, then it override the
default and directly control the limit for writes and async I/O.
--
2.46.4
^ permalink raw reply related
* [PATCH] block: Make WBT latency writes honor enable state
From: guzebing @ 2026-06-21 1:40 UTC (permalink / raw)
To: Jens Axboe; +Cc: Guzebing, linux-block, linux-kernel
From: Guzebing <guzebing1612@gmail.com>
queue/wbt_lat_usec controls both the stored WBT latency target and the
effective WBT enable state.
The old no-op check skipped updates whenever the converted latency
matched the stored min_lat_nsec. That check ignored whether the current
WBT state already matched the state requested by the write. For a queue
disabled by default, attempting to enable WBT by writing the default
value through sysfs could return success while the enable state was left
unchanged.
Treat a write as a no-op only when both the stored latency and the
effective WBT enabled state already match the converted value.
Signed-off-by: Guzebing <guzebing1612@gmail.com>
---
Background:
The issue can be reproduced on an NVMe namespace when BFQ is available:
echo bfq > /sys/block/nvme0n1/queue/scheduler
cat /sys/block/nvme0n1/queue/wbt_lat_usec
echo 2000 > /sys/block/nvme0n1/queue/wbt_lat_usec
cat /sys/block/nvme0n1/queue/wbt_lat_usec
After BFQ selects the queue, WBT is disabled by default. On a
non-rotational NVMe namespace the stored default latency remains
2000000 nsec, while the sysfs file reports 0 because the effective WBT
state is disabled:
queue/wbt_lat_usec = 0
debugfs enabled = 3
debugfs min_lat_nsec = 2000000
Writing the default value succeeds, but the old no-op check skips the
state transition because min_lat_nsec already matches the converted
value:
echo 2000 > /sys/block/nvme0n1/queue/wbt_lat_usec
# echo returns success, but:
queue/wbt_lat_usec = 0
debugfs enabled = 3
debugfs min_lat_nsec = 2000000
As a control, writing a non-default value first does work:
echo 5000 > /sys/block/nvme0n1/queue/wbt_lat_usec
queue/wbt_lat_usec = 5000
debugfs enabled = 2
debugfs min_lat_nsec = 5000000
Writing the default value after that also works, because the stored
latency changes from 5000000 nsec back to 2000000 nsec:
echo 2000 > /sys/block/nvme0n1/queue/wbt_lat_usec
queue/wbt_lat_usec = 2000
debugfs enabled = 2
debugfs min_lat_nsec = 2000000
With this patch, writing the default value after BFQ default-disables
WBT also re-enables WBT as expected:
queue/wbt_lat_usec = 2000
debugfs enabled = 2
debugfs min_lat_nsec = 2000000
block/blk-wbt.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index dcc2438ca16dc..953d400fd0137 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -813,6 +813,21 @@ static void wbt_queue_depth_changed(struct rq_qos *rqos)
wbt_update_limits(RQWB(rqos));
}
+static bool wbt_set_lat_changed(struct request_queue *q, u64 val)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ struct rq_wb *rwb;
+
+ if (!rqos)
+ return true;
+
+ rwb = RQWB(rqos);
+ if (rwb->min_lat_nsec != val)
+ return true;
+
+ return rwb_enabled(rwb) != !!val;
+}
+
static void wbt_exit(struct rq_qos *rqos)
{
struct rq_wb *rwb = RQWB(rqos);
@@ -1005,8 +1020,12 @@ int wbt_set_lat(struct gendisk *disk, s64 val)
else if (val >= 0)
val *= 1000ULL;
- if (wbt_get_min_lat(q) == val)
+ mutex_lock(&disk->rqos_state_mutex);
+ if (!wbt_set_lat_changed(q, val)) {
+ mutex_unlock(&disk->rqos_state_mutex);
goto out;
+ }
+ mutex_unlock(&disk->rqos_state_mutex);
blk_mq_quiesce_queue(q);
--
2.20.1
^ permalink raw reply related
* Re: [PATCH V2] blk-cgroup: fix UAF in __blkcg_rstat_flush()
From: Jose Fernandez (Anthropic) @ 2026-06-20 23:59 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Michal Koutný, stable, Jay Shin,
Tejun Heo, Waiman Long, coregee2000
In-Reply-To: <20260205155425.342084-1-ming.lei@redhat.com>
On Thu, 5 Feb 2026 23:54:23 +0800, Ming Lei wrote:
> Move the flush from __blkg_release() (RCU callback) to blkg_release()
> (before call_rcu). This ensures the RCU grace period waits for any
> concurrent flush's rcu_read_lock() section to complete before freeing.
We started seeing this in the wild on a 6.18.35-based kernel as a NULL
pointer dereference rather than a KASAN report. The freed blkg /
percpu iostat slot gets reallocated and zeroed before the concurrent
flusher reaches it, so bisc->blkg reads back as NULL:
BUG: kernel NULL pointer dereference, address: 0000000000000030
#PF: supervisor read access in kernel mode
RIP: 0010:__blkcg_rstat_flush.isra.0+0x8d/0x1c0
Code: ... 48 8b 1a 4c 8d 78 f8 31 c0 f3 48 ab <4c> 8b 73 30 ...
RBX: 0000000000000000
Call Trace:
<IRQ>
__blkg_release+0x2d/0xf0
rcu_do_batch+0x1b8/0x570
rcu_core+0x167/0x350
handle_softirqs+0xda/0x330
The workload is container-heavy with frequent block-device add/remove,
so multiple blkgs in the same blkcg routinely hit blkg_release()
concurrently on different CPUs.
I can reproduce reliably under KASAN by inserting a udelay(2000)
between llist_del_all() and raw_spin_lock_irqsave() in
__blkcg_rstat_flush(), then driving direct I/O to N loop devices from
one cgroup followed by parallel LOOP_CTL_REMOVE on each device. KASAN
reports slab-use-after-free in __blkcg_rstat_flush() with the expected
alloc=blkg_alloc / free=blkg_free_workfn stacks.
With this patch applied on top of the same udelay-widened tree, the
same harness runs 150 rounds clean.
This doesn't appear to have been picked up after V1 was dropped; would
be good to get it queued.
Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
^ permalink raw reply
* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: yu kuai @ 2026-06-20 18:29 UTC (permalink / raw)
To: Zizhi Wo, axboe, tj, josef, linux-block
Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>
在 2026/6/16 9:17, Zizhi Wo 写道:
> From: Zizhi Wo<wozizhi@huawei.com>
>
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
>
> BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
> Call Trace:
> ...
> blkcg_deactivate_policy+0x244/0x4d0
> ioc_rqos_exit+0x44/0xe0
> rq_qos_exit+0xba/0x120
> __del_gendisk+0x50b/0x800
> del_gendisk+0xff/0x190
> ...
>
> [CAUSE]
> process1 process2
> cgroup_rmdir
> ...
> css_killed_work_fn
> offline_css
> ...
> blkcg_destroy_blkgs
> ...
> __blkg_release
> css_put(&blkg->blkcg->css)
> blkg_free
> INIT_WORK(xxx, blkg_free_workfn)
> schedule_work
> css_put
> ...
> blkcg_css_free
> kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
> blkg_free_workfn
> __del_gendisk
> rq_qos_exit
> ioc_rqos_exit
> blkcg_deactivate_policy
> mutex_lock(&q->blkcg_mutex)
> spin_lock_irq(&q->queue_lock)
> list_for_each_entry(blkg, xxx)
> blkcg = blkg->blkcg
> spin_lock(&blkcg->lock)-------UAF!!!
> mutex_lock(&q->blkcg_mutex)
> spin_lock_irq(&q->queue_lock)
> /* Only then is the blkg removed from the list */
> list_del_init(&blkg->q_node)
>
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
>
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
>
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc() <-> blkg_free()
> blkg_create() <-> blkg_destroy()
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao<houtao1@huawei.com>
> Signed-off-by: Zizhi Wo<wozizhi@huawei.com>
> Reviewed-by: Yu Kuai<yukuai@fygo.io>
> ---
> v3:
> - move css_put() after mutex_unlock() in blkg_free_workfn().
>
> v2:
> - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
> css reference follows the blkg's own lifetime, making the put in
> blkg_free_workfn() symmetric with the get in blkg_alloc().
>
> v1:https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
> block/blk-cgroup.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
Reviewed-by: Yu Kuai <yukuai@fygo.io>
--
Thanks,
Kuai
^ permalink raw reply
* Re: [PATCH v4] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Tetsuo Handa @ 2026-06-20 9:42 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
Ming Lei, linux-block, LKML, Andrew Morton, Linus Torvalds,
linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner,
Hillf Danton
In-Reply-To: <20260620073939.GF2636677@ZenIV>
On 2026/06/20 16:39, Al Viro wrote:
> On Fri, Jun 19, 2026 at 11:33:11PM +0900, Tetsuo Handa wrote:
>> Sending this commit to linux.git will be the fastest way to identify who is issuing
>> I/O requests too late. Therefore, I want to get a conclusion on xfs/259 breakage.
>> Al, can you get the same result?
>
> Not a peep in the logs, breakage still there (with cherry-picked fb1d5846e99c8aa4ce
> and CONFIG_KCOV enabled, that is).
Please test with debug printk() patch shown below. What messages do you get?
--------------------------------------------------------------------------------
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3b607a3ddc4..7408f314a1fa 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1763,6 +1763,8 @@ static void lo_release(struct gendisk *disk)
mutex_unlock(&lo->lo_mutex);
if (need_clear) {
+ printk("Flush: task=%s[%d] dev=loop%d state=%d\n",
+ current->comm, current->pid, lo->lo_number, lo->lo_state);
/*
* Temporarily release disk->open_mutex in order to flush pending I/O
* requests before clearing the backing device.
@@ -1813,6 +1815,8 @@ static void lo_release(struct gendisk *disk)
mutex_lock(&lo->lo_disk->open_mutex);
if (WARN_ON(data_race(READ_ONCE(lo->lo_state)) != Lo_rundown))
return;
+ printk("Teardown: task=%s[%d] dev=loop%d state=%d\n",
+ current->comm, current->pid, lo->lo_number, lo->lo_state);
__loop_clr_fd(lo);
}
}
diff --git a/fs/namespace.c b/fs/namespace.c
index 09ab7fc72f86..9710460fb449 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1893,6 +1893,8 @@ static int do_umount(struct mount *mnt, int flags)
*/
lock_mount_hash();
if (!list_empty(&mnt->mnt_mounts) || mnt_get_count(mnt) != 2) {
+ printk("%s: task=%s[%d] !list_empty(&mnt->mnt_mounts)=%d mnt_get_count(mnt)=%d\n", __func__,
+ current->comm, current->pid, !list_empty(&mnt->mnt_mounts), mnt_get_count(mnt));
unlock_mount_hash();
return -EBUSY;
}
@@ -1960,6 +1962,9 @@ static int do_umount(struct mount *mnt, int flags)
if (!propagate_mount_busy(mnt, 2)) {
umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
retval = 0;
+ } else {
+ printk("%s: task=%s[%d] propagate_mount_busy()!=0\n", __func__,
+ current->comm, current->pid);
}
}
out:
--------------------------------------------------------------------------------
^ permalink raw reply related
* Re: [PATCH v4] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Al Viro @ 2026-06-20 7:39 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
Ming Lei, linux-block, LKML, Andrew Morton, Linus Torvalds,
linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner,
Hillf Danton
In-Reply-To: <a9254bf0-fc7a-4b2e-a62f-064e71016fb6@I-love.SAKURA.ne.jp>
On Fri, Jun 19, 2026 at 11:33:11PM +0900, Tetsuo Handa wrote:
> Sending this commit to linux.git will be the fastest way to identify who is issuing
> I/O requests too late. Therefore, I want to get a conclusion on xfs/259 breakage.
> Al, can you get the same result?
Not a peep in the logs, breakage still there (with cherry-picked fb1d5846e99c8aa4ce
and CONFIG_KCOV enabled, that is).
^ permalink raw reply
* Re: [PATCH blktests] Fix _get_page_size()
From: Bart Van Assche @ 2026-06-20 7:11 UTC (permalink / raw)
To: Shin'ichiro Kawasaki; +Cc: Jeff Moyer, linux-block, osandov, kch
In-Reply-To: <ajYabLMbEo6zyOWh@shinmob>
On 6/20/26 6:51 AM, Shin'ichiro Kawasaki wrote:
> On Jun 20, 2026 / 05:55, Bart Van Assche wrote:
>> On 6/20/26 3:26 AM, Shin'ichiro Kawasaki wrote:
>>> This is a rather fundamental change, so I would like to ask opinions from
>>> other blktests users, especially Omar and Chaitanya. What do you think about
>>> the idea to add getconf to the requirement list?
>>
>> CONFIG_PAGE_SHIFT was introduced in the Linux kernel in February 2024
>> (commit ba89f9c8ccba ("arch: consolidate existing CONFIG_PAGE_SIZE_*KB
>> definitions")). Older kernels had CONFIG_PAGE_SIZE_4KB,
>> CONFIG_PAGE_SIZE_16KB, etc. This means that it is possible to derive the
>> kernel page size from the kernel configuration file for all upstream and
>> distro kernels, isn't it?
>
> I checked the commit is in the tag v6.9. My Debian bookworm system has kernel
> v6.1, then the config file at /boot does not have CONFIG_PAGE_SHIFT as expected.
> But it does not have CONFIG_PAGE_SIZE_* either... I'm still afraid that kernel
> config file approach is not reliable.
Right, for older kernels CONFIG_PAGE_SIZE_*KB is only available for some
but not for all supported architectures.
It is not clear to me where the desire to avoid the dependency on
getconf comes from? As far as I know it is available on all Linux
distro's. Since it is typically included in the C library package it
should not introduce a new dependency.
Thanks,
Bart.
^ permalink raw reply
* Re: [PATCH blktests] Fix _get_page_size()
From: Shin'ichiro Kawasaki @ 2026-06-20 4:51 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Jeff Moyer, linux-block, osandov, kch
In-Reply-To: <d0432702-ac0b-410e-9586-2cb9be079033@acm.org>
On Jun 20, 2026 / 05:55, Bart Van Assche wrote:
> On 6/20/26 3:26 AM, Shin'ichiro Kawasaki wrote:
> > This is a rather fundamental change, so I would like to ask opinions from
> > other blktests users, especially Omar and Chaitanya. What do you think about
> > the idea to add getconf to the requirement list?
>
> CONFIG_PAGE_SHIFT was introduced in the Linux kernel in February 2024
> (commit ba89f9c8ccba ("arch: consolidate existing CONFIG_PAGE_SIZE_*KB
> definitions")). Older kernels had CONFIG_PAGE_SIZE_4KB,
> CONFIG_PAGE_SIZE_16KB, etc. This means that it is possible to derive the
> kernel page size from the kernel configuration file for all upstream and
> distro kernels, isn't it?
I checked the commit is in the tag v6.9. My Debian bookworm system has kernel
v6.1, then the config file at /boot does not have CONFIG_PAGE_SHIFT as expected.
But it does not have CONFIG_PAGE_SIZE_* either... I'm still afraid that kernel
config file approach is not reliable.
$ uname -a
Linux testnode3 6.1.0-49-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.174-1 (2026-05-26) x86_64 GNU/Linux
$ grep PAGE_S /boot/config-6.1.0-49-amd64
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
^ permalink raw reply
* Re: [PATCH blktests] Fix _get_page_size()
From: Bart Van Assche @ 2026-06-20 3:55 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, Jeff Moyer; +Cc: linux-block, osandov, kch
In-Reply-To: <ajXmBu9lDZwgMG7_@shinmob>
On 6/20/26 3:26 AM, Shin'ichiro Kawasaki wrote:
> This is a rather fundamental change, so I would like to ask opinions from
> other blktests users, especially Omar and Chaitanya. What do you think about
> the idea to add getconf to the requirement list?
CONFIG_PAGE_SHIFT was introduced in the Linux kernel in February 2024
(commit ba89f9c8ccba ("arch: consolidate existing CONFIG_PAGE_SIZE_*KB
definitions")). Older kernels had CONFIG_PAGE_SIZE_4KB,
CONFIG_PAGE_SIZE_16KB, etc. This means that it is possible to derive the
kernel page size from the kernel configuration file for all upstream and
distro kernels, isn't it?
Thanks,
Bart.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox