Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-22  8:42 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026 12:05:31 +0800
Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

I'm not really sure 'mutable' means anything either.
It is possible to make it valid for the loop body (or even other threads)
to delete arbitrary list items - but that needs significant extra overheads.

It might be worth doing something that doesn't need the extra variable,
but there is little point doing all the churn just to rename things.

> 
> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.
> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\

Use auto

> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)

The variable argument count logic really just slows down compilation.
Maybe there aren't enough copies of this code to make that significant.
But just because you can do it doesn't mean it is a gooD idea.
I'm also not sure it really adds anything to the readability.

And, it you are going to make the middle argument optional there is
no need to change the macro name.

	David



^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Jani Nikula @ 2026-06-22  8:37 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> Add *_mutable() iterator variants for list, hlist and llist.  The new
> helpers are variadic and support both forms.  In the common case, the
> caller omits the temporary cursor and the macro creates a unique internal
> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
> explicit temporary cursor, the caller can still pass it and the helper
> keeps the existing *_safe() behaviour.
>
> For example, a call site may use the shorter form:
>
>   list_for_each_entry_mutable(pos, head, member)
>
> or keep the explicit temporary cursor form:
>
>   list_for_each_entry_mutable(pos, tmp, head, member)

I'm unconvinced it's a good idea to allow two forms with macro trickery,
*especially* when it's not the last argument you can omit. I think it's
a footgun.

IMO stick with the first form only, and there'll always be the _safe
variant that can be used when the temp pointer is needed.


BR,
Jani.


-- 
Jani Nikula, Intel

^ permalink raw reply

* Re: [PATCH 0/2] tools: Fix tools/virtio test build
From: Yichong Chen @ 2026-06-22  7:01 UTC (permalink / raw)
  To: mst
  Cc: akpm, chenyichong, eperezma, jasowang, linux-kernel, ljs, rppt,
	virtualization, xuanzhuo
In-Reply-To: <20260618080405-mutt-send-email-mst@kernel.org>

Hi Michael,

I checked the history again. The tree was based on v7.1, but I had
unrelated local commits on top when I generated the series.

I rechecked this on a clean v7.1 tree:

  8cd9520d35a6 ("Linux 7.1")

The build failure is still reproducible with:

  make -C tools/virtio test

The first failure is:

  include/linux/virtio.h:10:10: fatal error:
  linux/mod_devicetable.h: No such file or directory

The two patches also apply cleanly on top of v7.1 and make
the tools/virtio test target build virtio_test, vringh_test and
vhost_net_test.

I can resend a v2 based on v7.1 with a base-commit if you prefer.

Thanks,
Yichong

^ permalink raw reply

* [PATCH v2] crypto: virtio - bound the akcipher result length
From: Bryam Vargas via B4 Relay @ 2026-06-22  6:52 UTC (permalink / raw)
  To: Michael S. Tsirkin, Gonglei, Jason Wang, Herbert Xu
  Cc: linux-kernel, virtualization, Xuan Zhuo, David S. Miller,
	Eugenio Pérez, linux-crypto

From: Bryam Vargas <hexlabsecurity@proton.me>

virtio_crypto_dataq_akcipher_callback() sets the result length from the
device-reported response length without bounding it to the destination
buffer, which was allocated for the original request length.
sg_copy_from_buffer() then reads that many bytes from the destination
buffer; a backend reporting a larger length over-reads adjacent kernel
heap into the caller's scatterlist (an out-of-bounds read).

Clamp the reported length to the originally requested destination length.
A conforming device reports no more than that, so valid results are
unaffected.

Fixes: a36bd0ad9fbf ("virtio-crypto: adjust dst_len at ops callback")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
v2: Fix the Subject line, mangled in v1 - an over-long subject was wrapped
    and its trailing word leaked into the commit body. No functional change.

Link to v1: https://lore.kernel.org/all/20260620-b4-disp-27caeeac-v1-1-956e8f9c4f01@proton.me/
---
 drivers/crypto/virtio/virtio_crypto_akcipher_algs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
index d8d452cac391..64ea141f018c 100644
--- a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
@@ -88,7 +88,8 @@ static void virtio_crypto_dataq_akcipher_callback(struct virtio_crypto_request *
 	}
 
 	/* actual length may be less than dst buffer */
-	akcipher_req->dst_len = len - sizeof(vc_req->status);
+	akcipher_req->dst_len = min_t(unsigned int, len - sizeof(vc_req->status),
+				      akcipher_req->dst_len);
 	sg_copy_from_buffer(akcipher_req->dst, sg_nents(akcipher_req->dst),
 			    vc_akcipher_req->dst_buf, akcipher_req->dst_len);
 	virtio_crypto_akcipher_finalize_req(vc_akcipher_req, akcipher_req, error);

---
base-commit: 1a3746ccbb0a97bed3c06ccde6b880013b1dddc1
change-id: 20260622-b4-disp-3a2c09a8-5ab0e3e3fc23

Best regards,
-- 
Bryam Vargas <hexlabsecurity@proton.me>



^ permalink raw reply related

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-22  6:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao, Muchun Song
In-Reply-To: <CAADnVQJmPWFT01b7DuLdtafv=8FyB84GYHNZ8zSTck+9Aw0JpA@mail.gmail.com>



在 2026/6/22 13:28, Alexei Starovoitov 写道:
> On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>
>> From: chengkaitao <chengkaitao@kylinos.cn>
>>
>> The list_for_each*_safe() helpers are used when the loop body may remove
>> the current entry.  Their current interface, however, forces every caller
>> to define a temporary cursor outside the macro and pass it in, even when
>> the caller never uses that cursor directly.  For most call sites this
>> extra cursor is just boilerplate required by the macro implementation.
>>
>> This is awkward because the saved next pointer is an internal detail of
>> the iteration.  Callers that only remove or move the current entry do not
>> need to spell it out.
>>
>> The _safe() suffix has also caused confusion.  Christian Koenig pointed
>> out that the name is easy to read as a thread-safe variant, especially
>> for beginners, even though it only means that the iterator keeps enough
>> state to tolerate removal of the current entry.  He suggested _mutable()
>> as a clearer description of what the loop permits.
>>
>> Add *_mutable() iterator variants for list, hlist and llist.  The new
>> helpers are variadic and support both forms.  In the common case, the
>> caller omits the temporary cursor and the macro creates a unique internal
>> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
>> explicit temporary cursor, the caller can still pass it and the helper
>> keeps the existing *_safe() behaviour.
>>
>> For example, a call site may use the shorter form:
>>
>>   list_for_each_entry_mutable(pos, head, member)
>>
>> or keep the explicit temporary cursor form:
>>
>>   list_for_each_entry_mutable(pos, tmp, head, member)
>>
>> The existing *_safe() helpers remain available for compatibility.  This
>> series only converts users in mm, block, kernel, init and io_uring.  If
>> this approach looks acceptable, the remaining users can be converted in
>> follow-up series.
>>
>> Changes in v3 (Christian König, Andy Shevchenko):
>> - Convert safe list walks to mutable iterators
>>
>> Changes in v2 (Muchun Song, Andy Shevchenko):
>> - Drop the list_for_each_entry_mutable*() helpers from v1 and make the
>>   cursor change directly in the existing list_for_each_entry*() helpers.
>> - Open-code special list walks that rely on updating the loop cursor in
>>   the body, preserving their existing traversal semantics.
>>
>> Link to v2:
>> https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
>>
>> Link to v1:
>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>
>> Kaitao Cheng (7):
>>   list: Add mutable iterator variants
>>   llist: Add mutable iterator variants
>>   mm: Use mutable list iterators
>>   block: Use mutable list iterators
>>   kernel: Use mutable list iterators
>>   initramfs: Use mutable list iterator
>>   io_uring: Use mutable list iterators
>>
>>  block/bfq-iosched.c                 |  17 +-
>>  block/blk-cgroup.c                  |  12 +-
>>  block/blk-flush.c                   |   4 +-
>>  block/blk-iocost.c                  |  18 +-
>>  block/blk-mq.c                      |   8 +-
>>  block/blk-throttle.c                |   4 +-
>>  block/kyber-iosched.c               |   4 +-
>>  block/partitions/ldm.c              |   8 +-
>>  block/sed-opal.c                    |   4 +-
>>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>>  include/linux/llist.h               |  81 +++++++--
>>  init/initramfs.c                    |   5 +-
>>  io_uring/cancel.c                   |   6 +-
>>  io_uring/poll.c                     |   3 +-
>>  io_uring/rw.c                       |   4 +-
>>  io_uring/timeout.c                  |   8 +-
>>  io_uring/uring_cmd.c                |   3 +-
>>  kernel/audit_tree.c                 |   4 +-
>>  kernel/audit_watch.c                |  16 +-
>>  kernel/auditfilter.c                |   4 +-
>>  kernel/auditsc.c                    |   4 +-
>>  kernel/bpf/arena.c                  |  10 +-
>>  kernel/bpf/arraymap.c               |   8 +-
>>  kernel/bpf/bpf_local_storage.c      |   3 +-
>>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>>  kernel/bpf/btf.c                    |  18 +-
>>  kernel/bpf/cgroup.c                 |   7 +-
>>  kernel/bpf/cpumap.c                 |   4 +-
>>  kernel/bpf/devmap.c                 |  10 +-
>>  kernel/bpf/helpers.c                |   8 +-
>>  kernel/bpf/local_storage.c          |   4 +-
>>  kernel/bpf/memalloc.c               |  16 +-
>>  kernel/bpf/offload.c                |   8 +-
>>  kernel/bpf/states.c                 |   4 +-
>>  kernel/bpf/stream.c                 |   4 +-
>>  kernel/bpf/verifier.c               |   6 +-
>>  kernel/cgroup/cgroup-v1.c           |   4 +-
>>  kernel/cgroup/cgroup.c              |  54 +++---
>>  kernel/cgroup/dmem.c                |  12 +-
>>  kernel/cgroup/rdma.c                |   8 +-
>>  kernel/events/core.c                |  44 +++--
>>  kernel/events/uprobes.c             |  12 +-
>>  kernel/exit.c                       |   8 +-
>>  kernel/fail_function.c              |   4 +-
>>  kernel/gcov/clang.c                 |   4 +-
>>  kernel/irq_work.c                   |   4 +-
>>  kernel/kexec_core.c                 |   4 +-
>>  kernel/kprobes.c                    |  16 +-
>>  kernel/livepatch/core.c             |   4 +-
>>  kernel/livepatch/core.h             |   4 +-
>>  kernel/liveupdate/kho_block.c       |   4 +-
>>  kernel/liveupdate/luo_flb.c         |   4 +-
>>  kernel/locking/rwsem.c              |   2 +-
>>  kernel/locking/test-ww_mutex.c      |   2 +-
>>  kernel/module/main.c                |  11 +-
>>  kernel/padata.c                     |   4 +-
>>  kernel/power/snapshot.c             |   8 +-
>>  kernel/power/wakelock.c             |   4 +-
>>  kernel/printk/printk.c              |  11 +-
>>  kernel/ptrace.c                     |   4 +-
>>  kernel/rcu/rcutorture.c             |   3 +-
>>  kernel/rcu/tasks.h                  |   9 +-
>>  kernel/rcu/tree.c                   |   6 +-
>>  kernel/resource.c                   |   4 +-
>>  kernel/sched/core.c                 |   4 +-
>>  kernel/sched/ext.c                  |  22 +--
>>  kernel/sched/fair.c                 |  28 +--
>>  kernel/sched/topology.c             |   4 +-
>>  kernel/sched/wait.c                 |   4 +-
>>  kernel/seccomp.c                    |   4 +-
>>  kernel/signal.c                     |  11 +-
>>  kernel/smp.c                        |   4 +-
>>  kernel/taskstats.c                  |   8 +-
>>  kernel/time/clockevents.c           |   6 +-
>>  kernel/time/clocksource.c           |   4 +-
>>  kernel/time/posix-cpu-timers.c      |   4 +-
>>  kernel/time/posix-timers.c          |   3 +-
>>  kernel/torture.c                    |   3 +-
>>  kernel/trace/bpf_trace.c            |   4 +-
>>  kernel/trace/ftrace.c               |  49 +++--
>>  kernel/trace/ring_buffer.c          |  25 ++-
>>  kernel/trace/trace.c                |  12 +-
>>  kernel/trace/trace_dynevent.c       |   6 +-
>>  kernel/trace/trace_dynevent.h       |   5 +-
>>  kernel/trace/trace_events.c         |  35 ++--
>>  kernel/trace/trace_events_filter.c  |   4 +-
>>  kernel/trace/trace_events_hist.c    |   8 +-
>>  kernel/trace/trace_events_trigger.c |  17 +-
>>  kernel/trace/trace_events_user.c    |  16 +-
>>  kernel/trace/trace_stat.c           |   4 +-
>>  kernel/user-return-notifier.c       |   3 +-
>>  kernel/workqueue.c                  |  16 +-
>>  mm/backing-dev.c                    |   8 +-
>>  mm/balloon.c                        |   8 +-
>>  mm/cma.c                            |   4 +-
>>  mm/compaction.c                     |   4 +-
>>  mm/damon/core.c                     |   4 +-
>>  mm/damon/sysfs-schemes.c            |   4 +-
>>  mm/dmapool.c                        |   4 +-
>>  mm/huge_memory.c                    |   8 +-
>>  mm/hugetlb.c                        |  56 +++---
>>  mm/hugetlb_vmemmap.c                |  16 +-
>>  mm/khugepaged.c                     |  14 +-
>>  mm/kmemleak.c                       |   7 +-
>>  mm/ksm.c                            |  25 +--
>>  mm/list_lru.c                       |   4 +-
>>  mm/memcontrol-v1.c                  |   8 +-
>>  mm/memory-failure.c                 |  12 +-
>>  mm/memory-tiers.c                   |   4 +-
>>  mm/migrate.c                        |  23 ++-
>>  mm/mmu_notifier.c                   |   9 +-
>>  mm/page_alloc.c                     |   8 +-
>>  mm/page_reporting.c                 |   2 +-
>>  mm/percpu.c                         |  11 +-
>>  mm/pgtable-generic.c                |   4 +-
>>  mm/rmap.c                           |  10 +-
>>  mm/shmem.c                          |   9 +-
>>  mm/slab_common.c                    |  14 +-
>>  mm/slub.c                           |  33 ++--
>>  mm/swapfile.c                       |   4 +-
>>  mm/userfaultfd.c                    |  12 +-
>>  mm/vmalloc.c                        |  24 +--
>>  mm/vmscan.c                         |   7 +-
>>  mm/zsmalloc.c                       |   4 +-
>>  124 files changed, 875 insertions(+), 681 deletions(-)
> 
> Not sure what you were thinking, but this diff stat
> is not landable.

[PATCH v3 1/7] and [PATCH v3 2/7] contain the main logic and can
be merged directly. They are also compatible with the old API.
[PATCH v3 3/7] through [PATCH v3 7/7] are just simple interface
replacements and do not change any functional logic. They can be
left unmerged for now; individual modules can pick them up later
if needed.

In v2, Andy Shevchenko mentioned: "If it's done by Linus himself
during the day when he prepares -rc1, it's fine." Even so, the
changes in this patch series are indeed quite large and touch
almost every subsystem. I have only converted part of them for
now, so I wanted to send this out first and see what people think.

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Alexei Starovoitov @ 2026-06-22  5:28 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>
> From: chengkaitao <chengkaitao@kylinos.cn>
>
> The list_for_each*_safe() helpers are used when the loop body may remove
> the current entry.  Their current interface, however, forces every caller
> to define a temporary cursor outside the macro and pass it in, even when
> the caller never uses that cursor directly.  For most call sites this
> extra cursor is just boilerplate required by the macro implementation.
>
> This is awkward because the saved next pointer is an internal detail of
> the iteration.  Callers that only remove or move the current entry do not
> need to spell it out.
>
> The _safe() suffix has also caused confusion.  Christian Koenig pointed
> out that the name is easy to read as a thread-safe variant, especially
> for beginners, even though it only means that the iterator keeps enough
> state to tolerate removal of the current entry.  He suggested _mutable()
> as a clearer description of what the loop permits.
>
> Add *_mutable() iterator variants for list, hlist and llist.  The new
> helpers are variadic and support both forms.  In the common case, the
> caller omits the temporary cursor and the macro creates a unique internal
> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
> explicit temporary cursor, the caller can still pass it and the helper
> keeps the existing *_safe() behaviour.
>
> For example, a call site may use the shorter form:
>
>   list_for_each_entry_mutable(pos, head, member)
>
> or keep the explicit temporary cursor form:
>
>   list_for_each_entry_mutable(pos, tmp, head, member)
>
> The existing *_safe() helpers remain available for compatibility.  This
> series only converts users in mm, block, kernel, init and io_uring.  If
> this approach looks acceptable, the remaining users can be converted in
> follow-up series.
>
> Changes in v3 (Christian König, Andy Shevchenko):
> - Convert safe list walks to mutable iterators
>
> Changes in v2 (Muchun Song, Andy Shevchenko):
> - Drop the list_for_each_entry_mutable*() helpers from v1 and make the
>   cursor change directly in the existing list_for_each_entry*() helpers.
> - Open-code special list walks that rely on updating the loop cursor in
>   the body, preserving their existing traversal semantics.
>
> Link to v2:
> https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
>
> Link to v1:
> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>
> Kaitao Cheng (7):
>   list: Add mutable iterator variants
>   llist: Add mutable iterator variants
>   mm: Use mutable list iterators
>   block: Use mutable list iterators
>   kernel: Use mutable list iterators
>   initramfs: Use mutable list iterator
>   io_uring: Use mutable list iterators
>
>  block/bfq-iosched.c                 |  17 +-
>  block/blk-cgroup.c                  |  12 +-
>  block/blk-flush.c                   |   4 +-
>  block/blk-iocost.c                  |  18 +-
>  block/blk-mq.c                      |   8 +-
>  block/blk-throttle.c                |   4 +-
>  block/kyber-iosched.c               |   4 +-
>  block/partitions/ldm.c              |   8 +-
>  block/sed-opal.c                    |   4 +-
>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>  include/linux/llist.h               |  81 +++++++--
>  init/initramfs.c                    |   5 +-
>  io_uring/cancel.c                   |   6 +-
>  io_uring/poll.c                     |   3 +-
>  io_uring/rw.c                       |   4 +-
>  io_uring/timeout.c                  |   8 +-
>  io_uring/uring_cmd.c                |   3 +-
>  kernel/audit_tree.c                 |   4 +-
>  kernel/audit_watch.c                |  16 +-
>  kernel/auditfilter.c                |   4 +-
>  kernel/auditsc.c                    |   4 +-
>  kernel/bpf/arena.c                  |  10 +-
>  kernel/bpf/arraymap.c               |   8 +-
>  kernel/bpf/bpf_local_storage.c      |   3 +-
>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>  kernel/bpf/btf.c                    |  18 +-
>  kernel/bpf/cgroup.c                 |   7 +-
>  kernel/bpf/cpumap.c                 |   4 +-
>  kernel/bpf/devmap.c                 |  10 +-
>  kernel/bpf/helpers.c                |   8 +-
>  kernel/bpf/local_storage.c          |   4 +-
>  kernel/bpf/memalloc.c               |  16 +-
>  kernel/bpf/offload.c                |   8 +-
>  kernel/bpf/states.c                 |   4 +-
>  kernel/bpf/stream.c                 |   4 +-
>  kernel/bpf/verifier.c               |   6 +-
>  kernel/cgroup/cgroup-v1.c           |   4 +-
>  kernel/cgroup/cgroup.c              |  54 +++---
>  kernel/cgroup/dmem.c                |  12 +-
>  kernel/cgroup/rdma.c                |   8 +-
>  kernel/events/core.c                |  44 +++--
>  kernel/events/uprobes.c             |  12 +-
>  kernel/exit.c                       |   8 +-
>  kernel/fail_function.c              |   4 +-
>  kernel/gcov/clang.c                 |   4 +-
>  kernel/irq_work.c                   |   4 +-
>  kernel/kexec_core.c                 |   4 +-
>  kernel/kprobes.c                    |  16 +-
>  kernel/livepatch/core.c             |   4 +-
>  kernel/livepatch/core.h             |   4 +-
>  kernel/liveupdate/kho_block.c       |   4 +-
>  kernel/liveupdate/luo_flb.c         |   4 +-
>  kernel/locking/rwsem.c              |   2 +-
>  kernel/locking/test-ww_mutex.c      |   2 +-
>  kernel/module/main.c                |  11 +-
>  kernel/padata.c                     |   4 +-
>  kernel/power/snapshot.c             |   8 +-
>  kernel/power/wakelock.c             |   4 +-
>  kernel/printk/printk.c              |  11 +-
>  kernel/ptrace.c                     |   4 +-
>  kernel/rcu/rcutorture.c             |   3 +-
>  kernel/rcu/tasks.h                  |   9 +-
>  kernel/rcu/tree.c                   |   6 +-
>  kernel/resource.c                   |   4 +-
>  kernel/sched/core.c                 |   4 +-
>  kernel/sched/ext.c                  |  22 +--
>  kernel/sched/fair.c                 |  28 +--
>  kernel/sched/topology.c             |   4 +-
>  kernel/sched/wait.c                 |   4 +-
>  kernel/seccomp.c                    |   4 +-
>  kernel/signal.c                     |  11 +-
>  kernel/smp.c                        |   4 +-
>  kernel/taskstats.c                  |   8 +-
>  kernel/time/clockevents.c           |   6 +-
>  kernel/time/clocksource.c           |   4 +-
>  kernel/time/posix-cpu-timers.c      |   4 +-
>  kernel/time/posix-timers.c          |   3 +-
>  kernel/torture.c                    |   3 +-
>  kernel/trace/bpf_trace.c            |   4 +-
>  kernel/trace/ftrace.c               |  49 +++--
>  kernel/trace/ring_buffer.c          |  25 ++-
>  kernel/trace/trace.c                |  12 +-
>  kernel/trace/trace_dynevent.c       |   6 +-
>  kernel/trace/trace_dynevent.h       |   5 +-
>  kernel/trace/trace_events.c         |  35 ++--
>  kernel/trace/trace_events_filter.c  |   4 +-
>  kernel/trace/trace_events_hist.c    |   8 +-
>  kernel/trace/trace_events_trigger.c |  17 +-
>  kernel/trace/trace_events_user.c    |  16 +-
>  kernel/trace/trace_stat.c           |   4 +-
>  kernel/user-return-notifier.c       |   3 +-
>  kernel/workqueue.c                  |  16 +-
>  mm/backing-dev.c                    |   8 +-
>  mm/balloon.c                        |   8 +-
>  mm/cma.c                            |   4 +-
>  mm/compaction.c                     |   4 +-
>  mm/damon/core.c                     |   4 +-
>  mm/damon/sysfs-schemes.c            |   4 +-
>  mm/dmapool.c                        |   4 +-
>  mm/huge_memory.c                    |   8 +-
>  mm/hugetlb.c                        |  56 +++---
>  mm/hugetlb_vmemmap.c                |  16 +-
>  mm/khugepaged.c                     |  14 +-
>  mm/kmemleak.c                       |   7 +-
>  mm/ksm.c                            |  25 +--
>  mm/list_lru.c                       |   4 +-
>  mm/memcontrol-v1.c                  |   8 +-
>  mm/memory-failure.c                 |  12 +-
>  mm/memory-tiers.c                   |   4 +-
>  mm/migrate.c                        |  23 ++-
>  mm/mmu_notifier.c                   |   9 +-
>  mm/page_alloc.c                     |   8 +-
>  mm/page_reporting.c                 |   2 +-
>  mm/percpu.c                         |  11 +-
>  mm/pgtable-generic.c                |   4 +-
>  mm/rmap.c                           |  10 +-
>  mm/shmem.c                          |   9 +-
>  mm/slab_common.c                    |  14 +-
>  mm/slub.c                           |  33 ++--
>  mm/swapfile.c                       |   4 +-
>  mm/userfaultfd.c                    |  12 +-
>  mm/vmalloc.c                        |  24 +--
>  mm/vmscan.c                         |   7 +-
>  mm/zsmalloc.c                       |   4 +-
>  124 files changed, 875 insertions(+), 681 deletions(-)

Not sure what you were thinking, but this diff stat
is not landable.

pw-bot: cr

^ permalink raw reply

* [PATCH v3 3/7] mm: Use mutable list iterators
From: Kaitao Cheng @ 2026-06-22  4:15 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, SeongJae Park, Muchun Song,
	Oscar Salvador, Catalin Marinas, Dave Chinner, Shakeel Butt,
	Miaohe Lin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Hugh Dickins, Chris Li, Kairui Song, Uladzislau Rezki,
	Minchan Kim, Sergey Senozhatsky
  Cc: linux-mm, linux-kernel, virtualization, damon, cgroups,
	Kaitao Cheng
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

From: Kaitao Cheng <chengkaitao@kylinos.cn>

The safe list iterators expose a temporary cursor at every call site,
even when the cursor is only needed by the iterator itself.  The mutable
iterator variants keep the removal-safe traversal semantics while hiding
that temporary cursor from callers that do not need it.

Convert mm users of the list, hlist and llist safe iterators to the new
mutable helpers.  Drop the temporary cursor variables where the loop does
not inspect or reset them, and keep the explicit cursor at the few sites
that rely on it across lock drops or after the loop.

This is a mechanical cleanup with no intended change in traversal order
or list mutation behavior.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/backing-dev.c         |  8 +++---
 mm/balloon.c             |  8 +++---
 mm/cma.c                 |  4 +--
 mm/compaction.c          |  4 +--
 mm/damon/core.c          |  4 +--
 mm/damon/sysfs-schemes.c |  4 +--
 mm/dmapool.c             |  4 +--
 mm/huge_memory.c         |  8 +++---
 mm/hugetlb.c             | 56 ++++++++++++++++++++--------------------
 mm/hugetlb_vmemmap.c     | 16 ++++++------
 mm/khugepaged.c          | 14 +++++-----
 mm/kmemleak.c            |  7 +++--
 mm/ksm.c                 | 25 +++++++-----------
 mm/list_lru.c            |  4 +--
 mm/memcontrol-v1.c       |  8 +++---
 mm/memory-failure.c      | 12 ++++-----
 mm/memory-tiers.c        |  4 +--
 mm/migrate.c             | 23 ++++++++---------
 mm/mmu_notifier.c        |  9 +++----
 mm/page_alloc.c          |  8 +++---
 mm/page_reporting.c      |  2 +-
 mm/percpu.c              | 11 ++++----
 mm/pgtable-generic.c     |  4 +--
 mm/rmap.c                | 10 +++----
 mm/shmem.c               |  9 ++++---
 mm/slab_common.c         | 14 +++++-----
 mm/slub.c                | 33 ++++++++++++-----------
 mm/swapfile.c            |  4 +--
 mm/userfaultfd.c         | 12 ++++-----
 mm/vmalloc.c             | 24 ++++++++---------
 mm/vmscan.c              |  7 +++--
 mm/zsmalloc.c            |  4 +--
 32 files changed, 175 insertions(+), 189 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index cecbcf9060a6..944b9cc7a424 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -932,10 +932,10 @@ static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
 void wb_memcg_offline(struct mem_cgroup *memcg)
 {
 	struct list_head *memcg_cgwb_list = &memcg->cgwb_list;
-	struct bdi_writeback *wb, *next;
+	struct bdi_writeback *wb;
 
 	spin_lock_irq(&cgwb_lock);
-	list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
+	list_for_each_entry_mutable(wb, memcg_cgwb_list, memcg_node)
 		cgwb_kill(wb);
 	memcg_cgwb_list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
@@ -951,11 +951,11 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
  */
 void wb_blkcg_offline(struct cgroup_subsys_state *css)
 {
-	struct bdi_writeback *wb, *next;
+	struct bdi_writeback *wb;
 	struct list_head *list = blkcg_get_cgwb_list(css);
 
 	spin_lock_irq(&cgwb_lock);
-	list_for_each_entry_safe(wb, next, list, blkcg_node)
+	list_for_each_entry_mutable(wb, list, blkcg_node)
 		cgwb_kill(wb);
 	list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
diff --git a/mm/balloon.c b/mm/balloon.c
index 96a8f1e20bc6..74a7c411b244 100644
--- a/mm/balloon.c
+++ b/mm/balloon.c
@@ -75,12 +75,12 @@ static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
 size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
 				 struct list_head *pages)
 {
-	struct page *page, *tmp;
+	struct page *page;
 	unsigned long flags;
 	size_t n_pages = 0;
 
 	spin_lock_irqsave(&balloon_pages_lock, flags);
-	list_for_each_entry_safe(page, tmp, pages, lru) {
+	list_for_each_entry_mutable(page, pages, lru) {
 		list_del(&page->lru);
 		balloon_page_enqueue_one(b_dev_info, page);
 		n_pages++;
@@ -111,12 +111,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
 size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
 				 struct list_head *pages, size_t n_req_pages)
 {
-	struct page *page, *tmp;
+	struct page *page;
 	unsigned long flags;
 	size_t n_pages = 0;
 
 	spin_lock_irqsave(&balloon_pages_lock, flags);
-	list_for_each_entry_safe(page, tmp, &b_dev_info->pages, lru) {
+	list_for_each_entry_mutable(page, &b_dev_info->pages, lru) {
 		if (n_pages == n_req_pages)
 			break;
 		list_del(&page->lru);
diff --git a/mm/cma.c b/mm/cma.c
index a13ce4999b39..2c6543fe530e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -539,7 +539,7 @@ int __init cma_declare_contiguous_multi(phys_addr_t total_size,
 	struct cma_memrange *cmrp;
 	LIST_HEAD(ranges);
 	LIST_HEAD(final_ranges);
-	struct list_head *mp, *next;
+	struct list_head *mp;
 	int ret, nr = 1;
 	u64 i;
 	struct cma *cma;
@@ -648,7 +648,7 @@ int __init cma_declare_contiguous_multi(phys_addr_t total_size,
 	 * want to mimic a bottom-up memblock allocation.
 	 */
 	sizesum = 0;
-	list_for_each_safe(mp, next, &ranges) {
+	list_for_each_mutable(mp, &ranges) {
 		mlp = list_entry(mp, struct cma_init_memrange, list);
 		list_del(mp);
 		list_insert_sorted(&final_ranges, mlp, basecmp);
diff --git a/mm/compaction.c b/mm/compaction.c
index b776f35ad020..1734f7978983 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -94,9 +94,9 @@ static unsigned long release_free_list(struct list_head *freepages)
 	unsigned long high_pfn = 0;
 
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
-		struct page *page, *next;
+		struct page *page;
 
-		list_for_each_entry_safe(page, next, &freepages[order], lru) {
+		list_for_each_entry_mutable(page, &freepages[order], lru) {
 			unsigned long pfn = page_to_pfn(page);
 
 			list_del(&page->lru);
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 7e4b9affc5b0..bb1f4466f7af 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -3394,7 +3394,7 @@ static void damon_verify_ctx(struct damon_ctx *c)
  */
 static void kdamond_call(struct damon_ctx *ctx, bool cancel)
 {
-	struct damon_call_control *control, *next;
+	struct damon_call_control *control;
 	LIST_HEAD(controls);
 
 	damon_verify_ctx(ctx);
@@ -3403,7 +3403,7 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel)
 	list_splice_tail_init(&ctx->call_controls, &controls);
 	mutex_unlock(&ctx->call_controls_lock);
 
-	list_for_each_entry_safe(control, next, &controls, list) {
+	list_for_each_entry_mutable(control, &controls, list) {
 		if (!control->repeat || cancel)
 			list_del(&control->list);
 
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 329cfd0bbe9f..701b1947bad4 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -329,9 +329,9 @@ static ssize_t total_bytes_show(struct kobject *kobj,
 static void damon_sysfs_scheme_regions_rm_dirs(
 		struct damon_sysfs_scheme_regions *regions)
 {
-	struct damon_sysfs_scheme_region *r, *next;
+	struct damon_sysfs_scheme_region *r;
 
-	list_for_each_entry_safe(r, next, &regions->regions_list, list) {
+	list_for_each_entry_mutable(r, &regions->regions_list, list) {
 		damos_sysfs_region_rm_dirs(r);
 		list_del(&r->list);
 		kobject_put(&r->kobj);
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 5d8af6e29127..226b505ace23 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -362,7 +362,7 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-	struct dma_page *page, *tmp;
+	struct dma_page *page;
 	bool empty, busy = false;
 
 	if (unlikely(!pool))
@@ -382,7 +382,7 @@ void dma_pool_destroy(struct dma_pool *pool)
 		busy = true;
 	}
 
-	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
+	list_for_each_entry_mutable(page, &pool->page_list, page_list) {
 		if (!busy)
 			dma_free_coherent(pool->dev, pool->allocation,
 					  page->vaddr, page->dma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2bccb0a53a0a..39d604f0876d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -924,9 +924,9 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 
 static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
 {
-	struct thpsize *thpsize, *tmp;
+	struct thpsize *thpsize;
 
-	list_for_each_entry_safe(thpsize, tmp, &thpsize_list, node) {
+	list_for_each_entry_mutable(thpsize, &thpsize_list, node) {
 		list_del(&thpsize->node);
 		kobject_put(&thpsize->kobj);
 	}
@@ -4462,14 +4462,14 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
-	struct folio *folio, *next;
+	struct folio *folio;
 	int split = 0;
 	unsigned long isolated;
 
 	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
 					    deferred_split_isolate, &dispose);
 
-	list_for_each_entry_safe(folio, next, &dispose, _deferred_list) {
+	list_for_each_entry_mutable(folio, &dispose, _deferred_list) {
 		bool did_split = false;
 		bool underused = false;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 571212b80835..765552a56086 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -598,7 +598,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	long add = 0;
 	struct list_head *head = &resv->regions;
 	long last_accounted_offset = f;
-	struct file_region *iter, *trg = NULL;
+	struct file_region *iter;
 	struct list_head *rg = NULL;
 
 	if (regions_needed)
@@ -608,7 +608,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	 * [last_accounted_offset, iter->from), at every iteration, with some
 	 * bounds checking.
 	 */
-	list_for_each_entry_safe(iter, trg, head, link) {
+	list_for_each_entry_mutable(iter, head, link) {
 		/* Skip irrelevant regions that start before our range. */
 		if (iter->from < f) {
 			/* If this region ends after the last accounted offset,
@@ -700,7 +700,7 @@ static int allocate_file_region_entries(struct resv_map *resv,
 	return 0;
 
 out_of_memory:
-	list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+	list_for_each_entry_mutable(rg, &allocated_regions, link) {
 		list_del(&rg->link);
 		kfree(rg);
 	}
@@ -853,13 +853,13 @@ static void region_abort(struct resv_map *resv, long f, long t,
 static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *trg;
+	struct file_region *rg;
 	struct file_region *nrg = NULL;
 	long del = 0;
 
 retry:
 	spin_lock(&resv->lock);
-	list_for_each_entry_safe(rg, trg, head, link) {
+	list_for_each_entry_mutable(rg, head, link) {
 		/*
 		 * Skip regions before the range to be deleted.  file_region
 		 * ranges are normally of the form [from, to).  However, there
@@ -1109,13 +1109,13 @@ void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 	struct list_head *head = &resv_map->region_cache;
-	struct file_region *rg, *trg;
+	struct file_region *rg;
 
 	/* Clear out any active regions before we release the map. */
 	region_del(resv_map, 0, LONG_MAX);
 
 	/* ... and any entries left in the cache */
-	list_for_each_entry_safe(rg, trg, head, link) {
+	list_for_each_entry_mutable(rg, head, link) {
 		list_del(&rg->link);
 		kfree(rg);
 	}
@@ -1582,7 +1582,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 					struct list_head *folio_list,
 					struct list_head *non_hvo_folios)
 {
-	struct folio *folio, *t_folio;
+	struct folio *folio;
 
 	if (!list_empty(non_hvo_folios)) {
 		/*
@@ -1592,7 +1592,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 		 * hugetlb pages with vmemmap we will free up memory so that we
 		 * can allocate vmemmap for more hugetlb pages.
 		 */
-		list_for_each_entry_safe(folio, t_folio, non_hvo_folios, lru) {
+		list_for_each_entry_mutable(folio, non_hvo_folios, lru) {
 			list_del(&folio->lru);
 			spin_lock_irq(&hugetlb_lock);
 			__folio_clear_hugetlb(folio);
@@ -1611,7 +1611,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 		 * If are able to restore vmemmap and free one hugetlb page, we
 		 * quit processing the list to retry the bulk operation.
 		 */
-		list_for_each_entry_safe(folio, t_folio, folio_list, lru)
+		list_for_each_entry_mutable(folio, folio_list, lru)
 			if (hugetlb_vmemmap_restore_folio(h, folio)) {
 				list_del(&folio->lru);
 				spin_lock_irq(&hugetlb_lock);
@@ -1633,7 +1633,7 @@ static void update_and_free_pages_bulk(struct hstate *h,
 						struct list_head *folio_list)
 {
 	long ret;
-	struct folio *folio, *t_folio;
+	struct folio *folio;
 	LIST_HEAD(non_hvo_folios);
 
 	/*
@@ -1664,7 +1664,7 @@ static void update_and_free_pages_bulk(struct hstate *h,
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
-	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
+	list_for_each_entry_mutable(folio, &non_hvo_folios, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
 	}
@@ -1875,14 +1875,14 @@ void prep_and_add_allocated_folios(struct hstate *h,
 				   struct list_head *folio_list)
 {
 	unsigned long flags;
-	struct folio *folio, *tmp_f;
+	struct folio *folio;
 
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_folios(h, folio_list);
 
 	/* Add all new pool pages to free lists in one lock cycle */
 	spin_lock_irqsave(&hugetlb_lock, flags);
-	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+	list_for_each_entry_mutable(folio, folio_list, lru) {
 		account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
 	}
@@ -2246,7 +2246,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	__must_hold(&hugetlb_lock)
 {
 	LIST_HEAD(surplus_list);
-	struct folio *folio, *tmp;
+	struct folio *folio;
 	int ret;
 	long i;
 	long needed, allocated;
@@ -2319,7 +2319,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	ret = 0;
 
 	/* Free the needed pages to the hugetlb pool */
-	list_for_each_entry_safe(folio, tmp, &surplus_list, lru) {
+	list_for_each_entry_mutable(folio, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
 		/* Add the page to the hugetlb allocator */
@@ -2332,7 +2332,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * Free unnecessary surplus pages to the buddy allocator.
 	 * Pages have no ref count, call free_huge_folio directly.
 	 */
-	list_for_each_entry_safe(folio, tmp, &surplus_list, lru)
+	list_for_each_entry_mutable(folio, &surplus_list, lru)
 		free_huge_folio(folio);
 	spin_lock_irq(&hugetlb_lock);
 
@@ -3197,12 +3197,12 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 					struct list_head *folio_list)
 {
 	unsigned long flags;
-	struct folio *folio, *tmp_f;
+	struct folio *folio;
 
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_bootmem_folios(h, folio_list);
 
-	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+	list_for_each_entry_mutable(folio, folio_list, lru) {
 		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
 			/*
 			 * If HVO fails, initialize all tail struct pages
@@ -3281,10 +3281,10 @@ static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
 static void __init gather_bootmem_prealloc_node(unsigned long nid)
 {
 	LIST_HEAD(folio_list);
-	struct huge_bootmem_page *m, *tm;
+	struct huge_bootmem_page *m;
 	struct hstate *h = NULL, *prev_h = NULL;
 
-	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
+	list_for_each_entry_mutable(m, &huge_boot_pages[nid], list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
 
@@ -3669,9 +3669,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 * Collect pages to be freed on a list, and free after dropping lock
 	 */
 	for_each_node_mask(i, *nodes_allowed) {
-		struct folio *folio, *next;
+		struct folio *folio;
 		struct list_head *freel = &h->hugepage_freelists[i];
-		list_for_each_entry_safe(folio, next, freel, lru) {
+		list_for_each_entry_mutable(folio, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				goto out;
 			if (folio_test_highmem(folio))
@@ -3920,7 +3920,7 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
 				       struct list_head *src_list)
 {
 	long rc;
-	struct folio *folio, *next;
+	struct folio *folio;
 	LIST_HEAD(dst_list);
 	LIST_HEAD(ret_list);
 
@@ -3937,7 +3937,7 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
 	 */
 	mutex_lock(&dst->resize_lock);
 
-	list_for_each_entry_safe(folio, next, src_list, lru) {
+	list_for_each_entry_mutable(folio, src_list, lru) {
 		int i;
 		bool cma;
 
@@ -3995,9 +3995,9 @@ long demote_pool_huge_page(struct hstate *src, nodemask_t *nodes_allowed,
 
 	for_each_node_mask_to_free(src, nr_nodes, node, nodes_allowed) {
 		LIST_HEAD(list);
-		struct folio *folio, *next;
+		struct folio *folio;
 
-		list_for_each_entry_safe(folio, next, &src->hugepage_freelists[node], lru) {
+		list_for_each_entry_mutable(folio, &src->hugepage_freelists[node], lru) {
 			if (folio_test_hwpoison(folio))
 				continue;
 
@@ -4014,7 +4014,7 @@ long demote_pool_huge_page(struct hstate *src, nodemask_t *nodes_allowed,
 
 		spin_lock_irq(&hugetlb_lock);
 
-		list_for_each_entry_safe(folio, next, &list, lru) {
+		list_for_each_entry_mutable(folio, &list, lru) {
 			list_del(&folio->lru);
 			add_hugetlb_folio(src, folio, false);
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 133b46dfb09f..88552d60ae60 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -193,9 +193,9 @@ static inline void free_vmemmap_page(struct page *page)
 /* Free a list of the vmemmap pages */
 static void free_vmemmap_page_list(struct list_head *list)
 {
-	struct page *page, *next;
+	struct page *page;
 
-	list_for_each_entry_safe(page, next, list, lru)
+	list_for_each_entry_mutable(page, list, lru)
 		free_vmemmap_page(page);
 }
 
@@ -339,7 +339,7 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
 	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
 	int nid = page_to_nid((struct page *)start);
-	struct page *page, *next;
+	struct page *page;
 	int i;
 
 	for (i = 0; i < nr_pages; i++) {
@@ -352,7 +352,7 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
 
 	return 0;
 out:
-	list_for_each_entry_safe(page, next, list, lru)
+	list_for_each_entry_mutable(page, list, lru)
 		__free_page(page);
 	return -ENOMEM;
 }
@@ -454,12 +454,12 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 					struct list_head *folio_list,
 					struct list_head *non_hvo_folios)
 {
-	struct folio *folio, *t_folio;
+	struct folio *folio;
 	long restored = 0;
 	long ret = 0;
 	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
 
-	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
+	list_for_each_entry_mutable(folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
 			ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
 			if (ret)
@@ -800,7 +800,7 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
 
 void __init hugetlb_vmemmap_init_late(int nid)
 {
-	struct huge_bootmem_page *m, *tm;
+	struct huge_bootmem_page *m;
 	unsigned long phys, nr_pages, start, end;
 	unsigned long pfn, nr_mmap;
 	struct zone *zone = NULL;
@@ -810,7 +810,7 @@ void __init hugetlb_vmemmap_init_late(int nid)
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return;
 
-	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
+	list_for_each_entry_mutable(m, &huge_boot_pages[nid], list) {
 		if (!(m->flags & HUGE_BOOTMEM_HVO))
 			continue;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49..66a1d72b5cb8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -640,7 +640,7 @@ static void release_pte_folio(struct folio *folio)
 static void release_pte_pages(pte_t *pte, pte_t *_pte,
 		struct list_head *compound_pagelist)
 {
-	struct folio *folio, *tmp;
+	struct folio *folio;
 
 	while (--_pte >= pte) {
 		pte_t pteval = ptep_get(_pte);
@@ -658,7 +658,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
 		release_pte_folio(folio);
 	}
 
-	list_for_each_entry_safe(folio, tmp, compound_pagelist, lru) {
+	list_for_each_entry_mutable(folio, compound_pagelist, lru) {
 		list_del(&folio->lru);
 		release_pte_folio(folio);
 	}
@@ -835,7 +835,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 {
 	const unsigned long nr_pages = 1UL << order;
 	unsigned long end = address + (PAGE_SIZE * nr_pages);
-	struct folio *src, *tmp;
+	struct folio *src;
 	pte_t pteval;
 	pte_t *_pte;
 	unsigned int nr_ptes;
@@ -882,7 +882,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 		}
 	}
 
-	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
+	list_for_each_entry_mutable(src, compound_pagelist, lru) {
 		list_del(&src->lru);
 		node_stat_sub_folio(src, NR_ISOLATED_ANON +
 				folio_is_file_lru(src));
@@ -2244,7 +2244,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *dst;
-	struct folio *folio, *tmp, *new_folio;
+	struct folio *folio, *new_folio;
 	pgoff_t index = 0, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
@@ -2629,7 +2629,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	/*
 	 * The collapse has succeeded, so free the old folios.
 	 */
-	list_for_each_entry_safe(folio, tmp, &pagelist, lru) {
+	list_for_each_entry_mutable(folio, &pagelist, lru) {
 		list_del(&folio->lru);
 		lruvec_stat_mod_folio(folio, NR_FILE_PAGES,
 				      -folio_nr_pages(folio));
@@ -2654,7 +2654,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		shmem_uncharge(mapping->host, nr_none);
 	}
 
-	list_for_each_entry_safe(folio, tmp, &pagelist, lru) {
+	list_for_each_entry_mutable(folio, &pagelist, lru) {
 		list_del(&folio->lru);
 		folio_unlock(folio);
 		folio_putback_lru(folio);
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7c7ba17ce7af..0c0265f7b19f 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -537,7 +537,6 @@ static void mem_pool_free(struct kmemleak_object *object)
  */
 static void free_object_rcu(struct rcu_head *rcu)
 {
-	struct hlist_node *tmp;
 	struct kmemleak_scan_area *area;
 	struct kmemleak_object *object =
 		container_of(rcu, struct kmemleak_object, rcu);
@@ -546,7 +545,7 @@ static void free_object_rcu(struct rcu_head *rcu)
 	 * Once use_count is 0 (guaranteed by put_object), there is no other
 	 * code accessing this object, hence no need for locking.
 	 */
-	hlist_for_each_entry_safe(area, tmp, &object->area_list, node) {
+	hlist_for_each_entry_mutable(area, &object->area_list, node) {
 		hlist_del(&area->node);
 		kmem_cache_free(scan_area_cache, area);
 	}
@@ -2324,14 +2323,14 @@ static const struct file_operations kmemleak_fops = {
 
 static void __kmemleak_do_cleanup(void)
 {
-	struct kmemleak_object *object, *tmp;
+	struct kmemleak_object *object;
 	unsigned int cnt = 0;
 
 	/*
 	 * Kmemleak has already been disabled, no need for RCU list traversal
 	 * or kmemleak_lock held.
 	 */
-	list_for_each_entry_safe(object, tmp, &object_list, object_list) {
+	list_for_each_entry_mutable(object, &object_list, object_list) {
 		__remove_object(object);
 		__delete_object(object);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 7d5b76478f0b..f42bc885f179 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1145,7 +1145,6 @@ static int remove_stable_node_chain(struct ksm_stable_node *stable_node,
 				    struct rb_root *root)
 {
 	struct ksm_stable_node *dup;
-	struct hlist_node *hlist_safe;
 
 	if (!is_stable_node_chain(stable_node)) {
 		VM_BUG_ON(is_stable_node_dup(stable_node));
@@ -1155,8 +1154,7 @@ static int remove_stable_node_chain(struct ksm_stable_node *stable_node,
 			return false;
 	}
 
-	hlist_for_each_entry_safe(dup, hlist_safe,
-				  &stable_node->hlist, hlist_dup) {
+	hlist_for_each_entry_mutable(dup, &stable_node->hlist, hlist_dup) {
 		VM_BUG_ON(!is_stable_node_dup(dup));
 		if (remove_stable_node(dup))
 			return true;
@@ -1168,7 +1166,7 @@ static int remove_stable_node_chain(struct ksm_stable_node *stable_node,
 
 static int remove_all_stable_nodes(void)
 {
-	struct ksm_stable_node *stable_node, *next;
+	struct ksm_stable_node *stable_node;
 	int nid;
 	int err = 0;
 
@@ -1184,7 +1182,7 @@ static int remove_all_stable_nodes(void)
 			cond_resched();
 		}
 	}
-	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
+	list_for_each_entry_mutable(stable_node, &migrate_nodes, list) {
 		if (remove_stable_node(stable_node))
 			err = -EBUSY;
 		cond_resched();
@@ -1665,7 +1663,6 @@ static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup,
 				     bool prune_stale_stable_nodes)
 {
 	struct ksm_stable_node *dup, *found = NULL, *stable_node = *_stable_node;
-	struct hlist_node *hlist_safe;
 	struct folio *folio, *tree_folio = NULL;
 	int found_rmap_hlist_len;
 
@@ -1677,8 +1674,7 @@ static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup,
 	else
 		stable_node->chain_prune_time = jiffies;
 
-	hlist_for_each_entry_safe(dup, hlist_safe,
-				  &stable_node->hlist, hlist_dup) {
+	hlist_for_each_entry_mutable(dup, &stable_node->hlist, hlist_dup) {
 		cond_resched();
 		/*
 		 * We must walk all stable_node_dup to prune the stale
@@ -2611,11 +2607,10 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 		 * so prune them once before each full scan.
 		 */
 		if (!ksm_merge_across_nodes) {
-			struct ksm_stable_node *stable_node, *next;
+			struct ksm_stable_node *stable_node;
 			struct folio *folio;
 
-			list_for_each_entry_safe(stable_node, next,
-						 &migrate_nodes, list) {
+			list_for_each_entry_mutable(stable_node, &migrate_nodes, list) {
 				folio = ksm_get_folio(stable_node,
 						      KSM_GET_FOLIO_NOLOCK);
 				if (folio)
@@ -3323,7 +3318,6 @@ static bool stable_node_chain_remove_range(struct ksm_stable_node *stable_node,
 					   struct rb_root *root)
 {
 	struct ksm_stable_node *dup;
-	struct hlist_node *hlist_safe;
 
 	if (!is_stable_node_chain(stable_node)) {
 		VM_BUG_ON(is_stable_node_dup(stable_node));
@@ -3331,8 +3325,7 @@ static bool stable_node_chain_remove_range(struct ksm_stable_node *stable_node,
 						    end_pfn);
 	}
 
-	hlist_for_each_entry_safe(dup, hlist_safe,
-				  &stable_node->hlist, hlist_dup) {
+	hlist_for_each_entry_mutable(dup, &stable_node->hlist, hlist_dup) {
 		VM_BUG_ON(!is_stable_node_dup(dup));
 		stable_node_dup_remove_range(dup, start_pfn, end_pfn);
 	}
@@ -3346,7 +3339,7 @@ static bool stable_node_chain_remove_range(struct ksm_stable_node *stable_node,
 static void ksm_check_stable_tree(unsigned long start_pfn,
 				  unsigned long end_pfn)
 {
-	struct ksm_stable_node *stable_node, *next;
+	struct ksm_stable_node *stable_node;
 	struct rb_node *node;
 	int nid;
 
@@ -3364,7 +3357,7 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
 			cond_resched();
 		}
 	}
-	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
+	list_for_each_entry_mutable(stable_node, &migrate_nodes, list) {
 		if (stable_node->kpfn >= start_pfn &&
 		    stable_node->kpfn < end_pfn)
 			remove_node_from_stable_tree(stable_node);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 36662d02ff96..ab9f48828a05 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -340,7 +340,7 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l = NULL;
-	struct list_head *item, *n;
+	struct list_head *item;
 	unsigned long isolated = 0;
 
 restart:
@@ -348,7 +348,7 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 				   /*irq_flags=*/NULL, /*skip_empty=*/true);
 	if (!l)
 		return isolated;
-	list_for_each_safe(item, n, &l->list) {
+	list_for_each_mutable(item, &l->list) {
 		enum lru_status ret;
 
 		/*
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..2e32f84a109a 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -986,11 +986,11 @@ static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
 static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
 	struct eventfd_ctx *eventfd)
 {
-	struct mem_cgroup_eventfd_list *ev, *tmp;
+	struct mem_cgroup_eventfd_list *ev;
 
 	spin_lock(&memcg_oom_lock);
 
-	list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) {
+	list_for_each_entry_mutable(ev, &memcg->oom_notify, list) {
 		if (ev->eventfd == eventfd) {
 			list_del(&ev->list);
 			kfree(ev);
@@ -1242,7 +1242,7 @@ void memcg1_memcg_init(struct mem_cgroup *memcg)
 
 void memcg1_css_offline(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_event *event, *tmp;
+	struct mem_cgroup_event *event;
 
 	/*
 	 * Unregister events and notify userspace.
@@ -1250,7 +1250,7 @@ void memcg1_css_offline(struct mem_cgroup *memcg)
 	 * directory to avoid race between userspace and kernelspace.
 	 */
 	spin_lock_irq(&memcg->event_list_lock);
-	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
+	list_for_each_entry_mutable(event, &memcg->event_list, list) {
 		list_del_init(&event->list);
 		schedule_work(&event->remove);
 	}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..e14d99adf378 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -423,9 +423,9 @@ static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
 static bool task_in_to_kill_list(struct list_head *to_kill,
 				 struct task_struct *tsk)
 {
-	struct to_kill *tk, *next;
+	struct to_kill *tk;
 
-	list_for_each_entry_safe(tk, next, to_kill, nd) {
+	list_for_each_entry_mutable(tk, to_kill, nd) {
 		if (tk->tsk == tsk)
 			return true;
 	}
@@ -450,9 +450,9 @@ void add_to_kill_ksm(struct task_struct *tsk, const struct page *p,
 static void kill_procs(struct list_head *to_kill, bool forcekill,
 		unsigned long pfn, int flags)
 {
-	struct to_kill *tk, *next;
+	struct to_kill *tk;
 
-	list_for_each_entry_safe(tk, next, to_kill, nd) {
+	list_for_each_entry_mutable(tk, to_kill, nd) {
 		if (forcekill) {
 			if (tk->addr == -EFAULT) {
 				pr_err("%#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
@@ -1860,11 +1860,11 @@ bool is_raw_hwpoison_page_in_hugepage(struct page *page)
 static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 {
 	struct llist_node *head;
-	struct raw_hwp_page *p, *next;
+	struct raw_hwp_page *p;
 	unsigned long count = 0;
 
 	head = llist_del_all(raw_hwp_list_head(folio));
-	llist_for_each_entry_safe(p, next, head, node) {
+	llist_for_each_entry_mutable(p, head, node) {
 		if (move_flag)
 			SetPageHWPoison(p->page);
 		else
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 54851d8a195b..4e0585925ae3 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -690,9 +690,9 @@ EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
 
 void mt_put_memory_types(struct list_head *memory_types)
 {
-	struct memory_dev_type *mtype, *mtn;
+	struct memory_dev_type *mtype;
 
-	list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+	list_for_each_entry_mutable(mtype, memory_types, list) {
 		list_del(&mtype->list);
 		put_memory_type(mtype);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index d9b23909d716..acc7925d1d1b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -257,9 +257,8 @@ static int migrate_movable_ops_page(struct page *dst, struct page *src,
 void putback_movable_pages(struct list_head *l)
 {
 	struct folio *folio;
-	struct folio *folio2;
 
-	list_for_each_entry_safe(folio, folio2, l, lru) {
+	list_for_each_entry_mutable(folio, l, lru) {
 		if (unlikely(folio_test_hugetlb(folio))) {
 			folio_putback_hugetlb(folio);
 			continue;
@@ -336,7 +335,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 }
 
 struct rmap_walk_arg {
-	struct folio *folio;
+	struct folio *folio, *folio2;
 	bool map_unused_to_zeropage;
 };
 
@@ -1634,14 +1633,14 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 	int nr_failed = 0;
 	int nr_retry_pages = 0;
 	int pass = 0;
-	struct folio *folio, *folio2;
+	struct folio *folio;
 	int rc, nr_pages;
 
 	for (pass = 0; pass < NR_MAX_MIGRATE_PAGES_RETRY && retry; pass++) {
 		retry = 0;
 		nr_retry_pages = 0;
 
-		list_for_each_entry_safe(folio, folio2, from, lru) {
+		list_for_each_entry_mutable(folio, from, lru) {
 			if (!folio_test_hugetlb(folio))
 				continue;
 
@@ -1722,14 +1721,14 @@ static void migrate_folios_move(struct list_head *src_folios,
 		int *retry, int *thp_retry, int *nr_failed,
 		int *nr_retry_pages)
 {
-	struct folio *folio, *folio2, *dst, *dst2;
+	struct folio *folio, *dst, *dst2;
 	bool is_thp;
 	int nr_pages;
 	int rc;
 
 	dst = list_first_entry(dst_folios, struct folio, lru);
 	dst2 = list_next_entry(dst, lru);
-	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+	list_for_each_entry_mutable(folio, src_folios, lru) {
 		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
 		nr_pages = folio_nr_pages(folio);
 
@@ -1770,11 +1769,11 @@ static void migrate_folios_undo(struct list_head *src_folios,
 		free_folio_t put_new_folio, unsigned long private,
 		struct list_head *ret_folios)
 {
-	struct folio *folio, *folio2, *dst, *dst2;
+	struct folio *folio, *dst, *dst2;
 
 	dst = list_first_entry(dst_folios, struct folio, lru);
 	dst2 = list_next_entry(dst, lru);
-	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+	list_for_each_entry_mutable(folio, src_folios, lru) {
 		int old_folio_state = 0;
 		struct anon_vma *anon_vma = NULL;
 
@@ -1810,7 +1809,7 @@ static int migrate_pages_batch(struct list_head *from,
 	int pass = 0;
 	bool is_thp = false;
 	bool is_large = false;
-	struct folio *folio, *folio2, *dst = NULL;
+	struct folio *folio, *dst = NULL;
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
@@ -1824,7 +1823,7 @@ static int migrate_pages_batch(struct list_head *from,
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		list_for_each_entry_safe(folio, folio2, from, lru) {
+		list_for_each_entry_mutable(folio, from, lru) {
 			is_large = folio_test_large(folio);
 			is_thp = folio_test_pmd_mappable(folio);
 			nr_pages = folio_nr_pages(folio);
@@ -2109,7 +2108,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 
 again:
 	nr_pages = 0;
-	list_for_each_entry_safe(folio, folio2, from, lru) {
+	list_for_each_entry_mutable(folio, folio2, from, lru) {
 		/* Retried hugetlb folios will be kept in list  */
 		if (folio_test_hugetlb(folio)) {
 			list_move_tail(&folio->lru, &ret_folios);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 245b74f39f91..7d4ccf9853a3 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -131,7 +131,6 @@ mn_itree_inv_next(struct mmu_interval_notifier *interval_sub,
 static void mn_itree_inv_end(struct mmu_notifier_subscriptions *subscriptions)
 {
 	struct mmu_interval_notifier *interval_sub;
-	struct hlist_node *next;
 
 	spin_lock(&subscriptions->lock);
 	if (--subscriptions->active_invalidate_ranges ||
@@ -149,9 +148,7 @@ static void mn_itree_inv_end(struct mmu_notifier_subscriptions *subscriptions)
 	 * they are progressed. This arrangement for tree updates is used to
 	 * avoid using a blocking lock during invalidate_range_start.
 	 */
-	hlist_for_each_entry_safe(interval_sub, next,
-				  &subscriptions->deferred_list,
-				  deferred_item) {
+	hlist_for_each_entry_mutable(interval_sub, &subscriptions->deferred_list, deferred_item) {
 		if (RB_EMPTY_NODE(&interval_sub->interval_tree.rb))
 			interval_tree_insert(&interval_sub->interval_tree,
 					     &subscriptions->itree);
@@ -263,9 +260,9 @@ EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
 static void mn_itree_finish_pass(struct llist_head *finish_passes)
 {
 	struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
-	struct mmu_interval_notifier_finish *f, *next;
+	struct mmu_interval_notifier_finish *f;
 
-	llist_for_each_entry_safe(f, next, first, link)
+	llist_for_each_entry_mutable(f, first, link)
 		f->notifier->ops->invalidate_finish(f);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..6d29df3e2973 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1559,10 +1559,10 @@ static void free_one_page(struct zone *zone, struct page *page,
 	llhead = &zone->trylock_free_pages;
 	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
 		struct llist_node *llnode;
-		struct page *p, *tmp;
+		struct page *p;
 
 		llnode = llist_del_all(llhead);
-		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+		llist_for_each_entry_mutable(p, llnode, pcp_llist) {
 			unsigned int p_order = p->private;
 
 			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
@@ -7022,10 +7022,10 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
 	int order;
 
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
-		struct page *page, *next;
+		struct page *page;
 		int nr_pages = 1 << order;
 
-		list_for_each_entry_safe(page, next, &list[order], lru) {
+		list_for_each_entry_mutable(page, &list[order], lru) {
 			int i;
 
 			post_alloc_hook(page, order, gfp_mask);
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..849266216c9f 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -180,7 +180,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
 
 	/* loop through free list adding unreported pages to sg list */
-	list_for_each_entry_safe(page, next, list, lru) {
+	list_for_each_entry_mutable(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..ae932e0e1ae6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1741,7 +1741,7 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 	bool do_warn;
 	struct obj_cgroup *objcg = NULL;
 	static atomic_t warn_limit = ATOMIC_INIT(10);
-	struct pcpu_chunk *chunk, *next;
+	struct pcpu_chunk *chunk;
 	const char *err;
 	int slot, off, cpu, ret;
 	unsigned long flags;
@@ -1814,8 +1814,7 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 restart:
 	/* search through normal chunks */
 	for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
-		list_for_each_entry_safe(chunk, next, &pcpu_chunk_lists[slot],
-					 list) {
+		list_for_each_entry_mutable(chunk, &pcpu_chunk_lists[slot], list) {
 			off = pcpu_find_block_fit(chunk, bits, bit_align,
 						  is_atomic);
 			if (off < 0) {
@@ -1952,7 +1951,7 @@ static void pcpu_balance_free(bool empty_only)
 {
 	LIST_HEAD(to_free);
 	struct list_head *free_head = &pcpu_chunk_lists[pcpu_free_slot];
-	struct pcpu_chunk *chunk, *next;
+	struct pcpu_chunk *chunk;
 
 	lockdep_assert_held(&pcpu_lock);
 
@@ -1960,7 +1959,7 @@ static void pcpu_balance_free(bool empty_only)
 	 * There's no reason to keep around multiple unused chunks and VM
 	 * areas can be scarce.  Destroy all free chunks except for one.
 	 */
-	list_for_each_entry_safe(chunk, next, free_head, list) {
+	list_for_each_entry_mutable(chunk, free_head, list) {
 		WARN_ON(chunk->immutable);
 
 		/* spare the first one */
@@ -1975,7 +1974,7 @@ static void pcpu_balance_free(bool empty_only)
 		return;
 
 	spin_unlock_irq(&pcpu_lock);
-	list_for_each_entry_safe(chunk, next, &to_free, list) {
+	list_for_each_entry_mutable(chunk, &to_free, list) {
 		unsigned int rs, re;
 
 		for_each_set_bitrange(rs, re, chunk->populated, chunk->nr_pages) {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..723b4bdb447d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -426,7 +426,7 @@ static struct {
 
 static void kernel_pgtable_work_func(struct work_struct *work)
 {
-	struct ptdesc *pt, *next;
+	struct ptdesc *pt;
 	LIST_HEAD(page_list);
 
 	spin_lock(&kernel_pgtable_work.lock);
@@ -434,7 +434,7 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
-	list_for_each_entry_safe(pt, next, &page_list, pt_list)
+	list_for_each_entry_mutable(pt, &page_list, pt_list)
 		__pagetable_free(pt);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c77d5dc06e9..37164f446d2d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -451,9 +451,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
  */
 static void cleanup_partial_anon_vmas(struct vm_area_struct *vma)
 {
-	struct anon_vma_chain *avc, *next;
+	struct anon_vma_chain *avc;
 
-	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+	list_for_each_entry_mutable(avc, &vma->anon_vma_chain, same_vma) {
 		list_del(&avc->same_vma);
 		anon_vma_chain_free(avc);
 	}
@@ -478,7 +478,7 @@ static void cleanup_partial_anon_vmas(struct vm_area_struct *vma)
  */
 void unlink_anon_vmas(struct vm_area_struct *vma)
 {
-	struct anon_vma_chain *avc, *next;
+	struct anon_vma_chain *avc;
 	struct anon_vma *active_anon_vma = vma->anon_vma;
 
 	/* Always hold mmap lock, read-lock on unmap possibly. */
@@ -496,7 +496,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	 * Unlink each anon_vma chained to the VMA.  This list is ordered
 	 * from newest to oldest, ensuring the root anon_vma gets freed last.
 	 */
-	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+	list_for_each_entry_mutable(avc, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
 
 		anon_vma_interval_tree_remove(avc, &anon_vma->rb_root);
@@ -528,7 +528,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	 * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
 	 * needing to write-acquire the anon_vma->root->rwsem.
 	 */
-	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+	list_for_each_entry_mutable(avc, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
 
 		VM_WARN_ON(anon_vma->num_children);
diff --git a/mm/shmem.c b/mm/shmem.c
index b51f83c970bb..9f03e46bfde2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -727,7 +727,8 @@ static const char *shmem_format_huge(int huge)
 static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		struct shrink_control *sc, unsigned long nr_to_free)
 {
-	LIST_HEAD(list), *pos, *next;
+	LIST_HEAD(list);
+	struct list_head *pos;
 	struct inode *inode;
 	struct shmem_inode_info *info;
 	struct folio *folio;
@@ -738,7 +739,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		return SHRINK_STOP;
 
 	spin_lock(&sbinfo->shrinklist_lock);
-	list_for_each_safe(pos, next, &sbinfo->shrinklist) {
+	list_for_each_mutable(pos, &sbinfo->shrinklist) {
 		info = list_entry(pos, struct shmem_inode_info, shrinklist);
 
 		/* pin the inode */
@@ -758,7 +759,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 	}
 	spin_unlock(&sbinfo->shrinklist_lock);
 
-	list_for_each_safe(pos, next, &list) {
+	list_for_each_mutable(pos, &list) {
 		pgoff_t next, end;
 		loff_t i_size;
 		int ret;
@@ -1547,7 +1548,7 @@ int shmem_unuse(unsigned int type)
 
 	spin_lock(&shmem_swaplist_lock);
 start_over:
-	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
+	list_for_each_entry_mutable(info, next, &shmem_swaplist, swaplist) {
 		if (!info->swapped) {
 			list_del_init(&info->swaplist);
 			continue;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b6426d7ceec9..489e8e0800b6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1465,7 +1465,7 @@ static int
 drain_page_cache(struct kfree_rcu_cpu *krcp)
 {
 	unsigned long flags;
-	struct llist_node *page_list, *pos, *n;
+	struct llist_node *page_list, *pos;
 	int freed = 0;
 
 	if (!rcu_min_cached_objs)
@@ -1476,7 +1476,7 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
 	WRITE_ONCE(krcp->nr_bkv_objs, 0);
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
-	llist_for_each_safe(pos, n, page_list) {
+	llist_for_each_mutable(pos, page_list) {
 		free_page((unsigned long)pos);
 		freed++;
 	}
@@ -1550,7 +1550,7 @@ kvfree_rcu_list(struct rcu_head *head)
 static void kfree_rcu_work(struct work_struct *work)
 {
 	unsigned long flags;
-	struct kvfree_rcu_bulk_data *bnode, *n;
+	struct kvfree_rcu_bulk_data *bnode;
 	struct list_head bulk_head[FREE_N_CHANNELS];
 	struct rcu_head *head;
 	struct kfree_rcu_cpu *krcp;
@@ -1576,7 +1576,7 @@ static void kfree_rcu_work(struct work_struct *work)
 	// Handle the first two channels.
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
 		// Start from the tail page, so a GP is likely passed for it.
-		list_for_each_entry_safe(bnode, n, &bulk_head[i], list)
+		list_for_each_entry_mutable(bnode, &bulk_head[i], list)
 			kvfree_rcu_bulk(krcp, bnode, i);
 	}
 
@@ -1674,7 +1674,7 @@ static void
 kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
 {
 	struct list_head bulk_ready[FREE_N_CHANNELS];
-	struct kvfree_rcu_bulk_data *bnode, *n;
+	struct kvfree_rcu_bulk_data *bnode;
 	struct rcu_head *head_ready = NULL;
 	unsigned long flags;
 	int i;
@@ -1683,7 +1683,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
 		INIT_LIST_HEAD(&bulk_ready[i]);
 
-		list_for_each_entry_safe_reverse(bnode, n, &krcp->bulk_head[i], list) {
+		list_for_each_entry_mutable_reverse(bnode, &krcp->bulk_head[i], list) {
 			if (!poll_state_synchronize_rcu_full(&bnode->gp_snap))
 				break;
 
@@ -1700,7 +1700,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
-		list_for_each_entry_safe(bnode, n, &bulk_ready[i], list)
+		list_for_each_entry_mutable(bnode, &bulk_ready[i], list)
 			kvfree_rcu_bulk(krcp, bnode, i);
 	}
 
diff --git a/mm/slub.c b/mm/slub.c
index 9ec774dc7009..6f4a79e32d75 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3253,7 +3253,7 @@ static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
 {
 	LIST_HEAD(empty_list);
 	LIST_HEAD(full_list);
-	struct slab_sheaf *sheaf, *sheaf2;
+	struct slab_sheaf *sheaf;
 	unsigned long flags;
 
 	spin_lock_irqsave(&barn->lock, flags);
@@ -3265,12 +3265,12 @@ static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
 
 	spin_unlock_irqrestore(&barn->lock, flags);
 
-	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+	list_for_each_entry_mutable(sheaf, &full_list, barn_list) {
 		sheaf_flush_unused(s, sheaf);
 		free_empty_sheaf(s, sheaf);
 	}
 
-	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+	list_for_each_entry_mutable(sheaf, &empty_list, barn_list)
 		free_empty_sheaf(s, sheaf);
 }
 
@@ -3757,7 +3757,7 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
 				  struct partial_bulk_context *pc,
 				  bool allow_spin)
 {
-	struct slab *slab, *slab2;
+	struct slab *slab;
 	struct slab *first = NULL, *last = NULL;
 	unsigned int total_free = 0;
 	unsigned long flags;
@@ -3773,7 +3773,7 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
 	else if (!spin_trylock_irqsave(&n->list_lock, flags))
 		return false;
 
-	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+	list_for_each_entry_mutable(slab, &n->partial, slab_list) {
 		struct freelist_counters flc;
 		unsigned int slab_free;
 
@@ -3828,7 +3828,7 @@ static void *get_from_partial_node(struct kmem_cache *s,
 				   gfp_t gfp_flags,
 				   const struct slab_alloc_context *ac)
 {
-	struct slab *slab, *slab2;
+	struct slab *slab;
 	unsigned long flags;
 	void *object = NULL;
 
@@ -3845,7 +3845,7 @@ static void *get_from_partial_node(struct kmem_cache *s,
 		spin_lock_irqsave(&n->list_lock, flags);
 	else if (!spin_trylock_irqsave(&n->list_lock, flags))
 		return NULL;
-	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+	list_for_each_entry_mutable(slab, &n->partial, slab_list) {
 
 		struct freelist_counters old, new;
 
@@ -6345,13 +6345,13 @@ static void free_deferred_objects(struct irq_work *work)
 {
 	struct defer_free *df = container_of(work, struct defer_free, work);
 	struct llist_head *objs = &df->objects;
-	struct llist_node *llnode, *pos, *t;
+	struct llist_node *llnode, *pos;
 
 	if (llist_empty(objs))
 		return;
 
 	llnode = llist_del_all(objs);
-	llist_for_each_safe(pos, t, llnode) {
+	llist_for_each_mutable(pos, llnode) {
 		struct kmem_cache *s;
 		struct slab *slab;
 		void *x = pos;
@@ -7185,7 +7185,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
 		      bool allow_spin)
 {
 	struct partial_bulk_context pc;
-	struct slab *slab, *slab2;
+	struct slab *slab;
 	unsigned int refilled = 0;
 	unsigned long flags;
 	void *object;
@@ -7197,7 +7197,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
 	if (!get_partial_node_bulk(s, n, &pc, allow_spin))
 		return 0;
 
-	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+	list_for_each_entry_mutable(slab, &pc.slabs, slab_list) {
 
 		unsigned int count;
 
@@ -8031,11 +8031,11 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab)
 static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 {
 	LIST_HEAD(discard);
-	struct slab *slab, *h;
+	struct slab *slab;
 
 	BUG_ON(irqs_disabled());
 	spin_lock_irq(&n->list_lock);
-	list_for_each_entry_safe(slab, h, &n->partial, slab_list) {
+	list_for_each_entry_mutable(slab, &n->partial, slab_list) {
 		if (!slab->inuse) {
 			remove_partial(n, slab);
 			list_add(&slab->slab_list, &discard);
@@ -8045,7 +8045,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	}
 	spin_unlock_irq(&n->list_lock);
 
-	list_for_each_entry_safe(slab, h, &discard, slab_list)
+	list_for_each_entry_mutable(slab, &discard, slab_list)
 		discard_slab(s, slab);
 }
 
@@ -8286,7 +8286,6 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 	int i;
 	struct kmem_cache_node *n;
 	struct slab *slab;
-	struct slab *t;
 	struct list_head discard;
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
@@ -8312,7 +8311,7 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		 * Note that concurrent frees may occur while we hold the
 		 * list_lock. slab->inuse here is the upper limit.
 		 */
-		list_for_each_entry_safe(slab, t, &n->partial, slab_list) {
+		list_for_each_entry_mutable(slab, &n->partial, slab_list) {
 			int free = slab->objects - slab->inuse;
 
 			/* Do not reread slab->inuse */
@@ -8339,7 +8338,7 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(slab, t, &discard, slab_list)
+		list_for_each_entry_mutable(slab, &discard, slab_list)
 			free_slab(s, slab);
 
 		if (node_nr_slabs(n))
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..e050b3894d6f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2825,14 +2825,14 @@ static int try_to_unuse(unsigned int type)
  */
 static void drain_mmlist(void)
 {
-	struct list_head *p, *next;
+	struct list_head *p;
 	unsigned int type;
 
 	for (type = 0; type < nr_swapfiles; type++)
 		if (swap_usage_in_pages(swap_info[type]))
 			return;
 	spin_lock(&mmlist_lock);
-	list_for_each_safe(p, next, &init_mm.mmlist)
+	list_for_each_mutable(p, &init_mm.mmlist)
 		list_del_init(p);
 	spin_unlock(&mmlist_lock);
 }
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b8d2d87ce8d7..78ef5f7e3f67 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -3010,9 +3010,9 @@ static void dup_fctx(struct userfaultfd_fork_ctx *fctx)
 
 void dup_userfaultfd_complete(struct list_head *fcs)
 {
-	struct userfaultfd_fork_ctx *fctx, *n;
+	struct userfaultfd_fork_ctx *fctx;
 
-	list_for_each_entry_safe(fctx, n, fcs, list) {
+	list_for_each_entry_mutable(fctx, fcs, list) {
 		dup_fctx(fctx);
 		list_del(&fctx->list);
 		kfree(fctx);
@@ -3021,7 +3021,7 @@ void dup_userfaultfd_complete(struct list_head *fcs)
 
 void dup_userfaultfd_fail(struct list_head *fcs)
 {
-	struct userfaultfd_fork_ctx *fctx, *n;
+	struct userfaultfd_fork_ctx *fctx;
 
 	/*
 	 * An error has occurred on fork, we will tear memory down, but have
@@ -3033,7 +3033,7 @@ void dup_userfaultfd_fail(struct list_head *fcs)
 	 *
 	 * mm tear down will take care of cleaning up VMA contexts.
 	 */
-	list_for_each_entry_safe(fctx, n, fcs, list) {
+	list_for_each_entry_mutable(fctx, fcs, list) {
 		struct userfaultfd_ctx *octx = fctx->orig;
 		struct userfaultfd_ctx *ctx = fctx->new;
 
@@ -3170,10 +3170,10 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, unsigned long start,
 
 void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf)
 {
-	struct userfaultfd_unmap_ctx *ctx, *n;
+	struct userfaultfd_unmap_ctx *ctx;
 	struct userfaultfd_wait_queue ewq;
 
-	list_for_each_entry_safe(ctx, n, uf, list) {
+	list_for_each_entry_mutable(ctx, uf, list) {
 		msg_init(&ewq.msg);
 
 		ewq.msg.event = UFFD_EVENT_UNMAP;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1afca3568b9b..2b510e7651df 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2202,13 +2202,13 @@ static void purge_fragmented_blocks_allcpus(void);
 static void
 reclaim_list_global(struct list_head *head)
 {
-	struct vmap_area *va, *n;
+	struct vmap_area *va;
 
 	if (list_empty(head))
 		return;
 
 	spin_lock(&free_vmap_area_lock);
-	list_for_each_entry_safe(va, n, head, list)
+	list_for_each_entry_mutable(va, head, list)
 		merge_or_add_vmap_area_augment(va,
 			&free_vmap_area_root, &free_vmap_area_list);
 	spin_unlock(&free_vmap_area_lock);
@@ -2219,7 +2219,7 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 {
 	LIST_HEAD(decay_list);
 	struct rb_root decay_root = RB_ROOT;
-	struct vmap_area *va, *nva;
+	struct vmap_area *va;
 	unsigned long n_decay, pool_len;
 	int i;
 
@@ -2242,7 +2242,7 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 			n_decay >>= 2;
 		pool_len -= n_decay;
 
-		list_for_each_entry_safe(va, nva, &tmp_list, list) {
+		list_for_each_entry_mutable(va, &tmp_list, list) {
 			if (!n_decay--)
 				break;
 
@@ -2299,7 +2299,7 @@ static void purge_vmap_node(struct work_struct *work)
 	struct vmap_node *vn = container_of(work,
 		struct vmap_node, purge_work);
 	unsigned long nr_purged_pages = 0;
-	struct vmap_area *va, *n_va;
+	struct vmap_area *va;
 	LIST_HEAD(local_list);
 
 	if (IS_ENABLED(CONFIG_KASAN_VMALLOC))
@@ -2307,7 +2307,7 @@ static void purge_vmap_node(struct work_struct *work)
 
 	vn->nr_purged = 0;
 
-	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
+	list_for_each_entry_mutable(va, &vn->purge_list, list) {
 		unsigned long nr = va_size(va) >> PAGE_SHIFT;
 		unsigned int vn_id = decode_vn_id(va->flags);
 
@@ -2803,9 +2803,9 @@ static bool purge_fragmented_block(struct vmap_block *vb,
 
 static void free_purged_blocks(struct list_head *purge_list)
 {
-	struct vmap_block *vb, *n_vb;
+	struct vmap_block *vb;
 
-	list_for_each_entry_safe(vb, n_vb, purge_list, purge) {
+	list_for_each_entry_mutable(vb, purge_list, purge) {
 		list_del(&vb->purge);
 		free_vmap_block(vb);
 	}
@@ -3386,9 +3386,9 @@ static void vm_reset_perms(struct vm_struct *area)
 static void delayed_vfree_work(struct work_struct *w)
 {
 	struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
-	struct llist_node *t, *llnode;
+	struct llist_node *llnode;
 
-	llist_for_each_safe(llnode, t, llist_del_all(&p->list))
+	llist_for_each_mutable(llnode, llist_del_all(&p->list))
 		vfree(llnode);
 }
 
@@ -3775,14 +3775,14 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 static LLIST_HEAD(pending_vm_area_cleanup);
 static void cleanup_vm_area_work(struct work_struct *work)
 {
-	struct vm_struct *area, *tmp;
+	struct vm_struct *area;
 	struct llist_node *head;
 
 	head = llist_del_all(&pending_vm_area_cleanup);
 	if (!head)
 		return;
 
-	llist_for_each_entry_safe(area, tmp, head, llnode) {
+	llist_for_each_entry_mutable(area, head, llnode) {
 		if (!area->pages)
 			free_vm_area(area);
 		else
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 35c3bb15ae96..d7c4ded7a8fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1598,11 +1598,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	};
 	struct reclaim_stat stat;
 	unsigned int nr_reclaimed;
-	struct folio *folio, *next;
+	struct folio *folio;
 	LIST_HEAD(clean_folios);
 	unsigned int noreclaim_flag;
 
-	list_for_each_entry_safe(folio, next, folio_list, lru) {
+	list_for_each_entry_mutable(folio, folio_list, lru) {
 		/* TODO: these pages should not even appear in this list. */
 		if (page_has_movable_ops(&folio->page))
 			continue;
@@ -4805,7 +4805,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
-	struct folio *next;
 	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
@@ -4841,7 +4840,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
+	list_for_each_entry_mutable_reverse(folio, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
 		if (!folio_evictable(folio)) {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 83f5820c45f9..2ac86c758e0b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1806,7 +1806,7 @@ static void async_free_zspage(struct work_struct *work)
 {
 	int i;
 	struct size_class *class;
-	struct zspage *zspage, *tmp;
+	struct zspage *zspage;
 	LIST_HEAD(free_pages);
 	struct zs_pool *pool = container_of(work, struct zs_pool,
 					free_work);
@@ -1822,7 +1822,7 @@ static void async_free_zspage(struct work_struct *work)
 		spin_unlock(&class->lock);
 	}
 
-	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
+	list_for_each_entry_mutable(zspage, &free_pages, list) {
 		list_del(&zspage->list);
 		lock_zspage(zspage);
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 2/7] llist: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-22  4:05 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

From: Kaitao Cheng <chengkaitao@kylinos.cn>

llist_for_each_safe() and llist_for_each_entry_safe() require callers to
provide a temporary cursor even when the cursor is only needed by the
iterator itself.  This makes call sites noisier than necessary for the
common case where the loop body may remove the current entry but does
not otherwise inspect the saved next pointer.

Add llist_for_each_mutable() and llist_for_each_entry_mutable() variants
that support both forms.  Callers may omit the temporary cursor and let
the helper create an internal unique cursor, or keep passing an explicit
cursor when the loop needs to inspect or reset it.

Keep the existing safe helpers as compatibility wrappers so current users
continue to build unchanged while new code can use the shorter mutable
form.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 include/linux/llist.h | 81 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 65 insertions(+), 16 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 8846b7709669..1c6f12411d5e 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -49,6 +49,7 @@
  */
 
 #include <linux/atomic.h>
+#include <linux/args.h>
 #include <linux/container_of.h>
 #include <linux/stddef.h>
 #include <linux/types.h>
@@ -143,12 +144,33 @@ static inline bool llist_on_list(const struct llist_node *node)
 #define llist_for_each(pos, node)			\
 	for ((pos) = (node); pos; (pos) = (pos)->next)
 
+/*
+ * llist_for_each_safe is an old interface, use llist_for_each_mutable instead.
+ */
+#define llist_for_each_safe(pos, n, node)			\
+	for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n))
+
+#define __llist_for_each_mutable_internal(pos, tmp, node)		\
+	for (typeof(pos) tmp = ((pos) = (node)) ? (pos)->next : NULL;	\
+	     (pos);							\
+	     (pos) = tmp, tmp = (pos) ? (pos)->next : NULL)
+
+#define __llist_for_each_mutable1(pos, node)				\
+	__llist_for_each_mutable_internal(pos, __UNIQUE_ID(next), node)
+
+#define __llist_for_each_mutable2(pos, next, node)			\
+	llist_for_each_safe(pos, next, node)
+
 /**
- * llist_for_each_safe - iterate over some deleted entries of a lock-less list
- *			 safe against removal of list entry
+ * llist_for_each_mutable - iterate over some deleted entries of a lock-less list
+ *			    safe against removal of list entry
  * @pos:	the &struct llist_node to use as a loop cursor
- * @n:		another &struct llist_node to use as temporary storage
- * @node:	the first entry of deleted list entries
+ * @...:	either (node) or (next, node)
+ *
+ * next:	another &struct llist_node to use as optional temporary storage.
+ *		The temporary cursor is internal unless explicitly supplied by
+ *		the caller.
+ * node:	the first entry of deleted list entries
  *
  * In general, some entries of the lock-less list can be traversed
  * safely only after being deleted from list, so start with an entry
@@ -159,8 +181,9 @@ static inline bool llist_on_list(const struct llist_node *node)
  * you want to traverse from the oldest to the newest, you must
  * reverse the order by yourself before traversing.
  */
-#define llist_for_each_safe(pos, n, node)			\
-	for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n))
+#define llist_for_each_mutable(pos, ...)				\
+	CONCATENATE(__llist_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
+		(pos, __VA_ARGS__)
 
 /**
  * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type
@@ -182,13 +205,41 @@ static inline bool llist_on_list(const struct llist_node *node)
 	     member_address_is_nonnull(pos, member);			\
 	     (pos) = llist_entry((pos)->member.next, typeof(*(pos)), member))
 
+/*
+ * llist_for_each_entry_safe is an old interface, use llist_for_each_entry_mutable instead.
+ */
+#define llist_for_each_entry_safe(pos, n, node, member)			       \
+	for (pos = llist_entry((node), typeof(*pos), member);		       \
+	     member_address_is_nonnull(pos, member) &&			       \
+	        (n = llist_entry(pos->member.next, typeof(*n), member), true); \
+	     pos = n)
+
+#define __llist_for_each_entry_mutable_internal(pos, tmp, node, member)	\
+	for (typeof(pos) tmp = ((pos) = llist_entry((node), typeof(*pos), member), \
+		member_address_is_nonnull(pos, member) ?			\
+		llist_entry((pos)->member.next, typeof(*pos), member) : NULL);	\
+	     member_address_is_nonnull(pos, member);				\
+	     (pos) = tmp, tmp = member_address_is_nonnull(pos, member) ?	\
+		llist_entry((pos)->member.next, typeof(*pos), member) : NULL)
+
+#define __llist_for_each_entry_mutable2(pos, node, member)			\
+	__llist_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), node, member)
+
+#define __llist_for_each_entry_mutable3(pos, next, node, member)		\
+	llist_for_each_entry_safe(pos, next, node, member)
+
 /**
- * llist_for_each_entry_safe - iterate over some deleted entries of lock-less list of given type
- *			       safe against removal of list entry
+ * llist_for_each_entry_mutable - iterate over some deleted entries of
+ *				  lock-less list of given type safe against
+ *				  removal of list entry
  * @pos:	the type * to use as a loop cursor.
- * @n:		another type * to use as temporary storage
- * @node:	the first entry of deleted list entries.
- * @member:	the name of the llist_node with the struct.
+ * @...:	either (node, member) or (next, node, member)
+ *
+ * next:	another type * to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * node:	the first entry of deleted list entries.
+ * member:	the name of the llist_node with the struct.
  *
  * In general, some entries of the lock-less list can be traversed
  * safely only after being removed from list, so start with an entry
@@ -199,11 +250,9 @@ static inline bool llist_on_list(const struct llist_node *node)
  * you want to traverse from the oldest to the newest, you must
  * reverse the order by yourself before traversing.
  */
-#define llist_for_each_entry_safe(pos, n, node, member)			       \
-	for (pos = llist_entry((node), typeof(*pos), member);		       \
-	     member_address_is_nonnull(pos, member) &&			       \
-	        (n = llist_entry(pos->member.next, typeof(*n), member), true); \
-	     pos = n)
+#define llist_for_each_entry_mutable(pos, ...)				\
+	CONCATENATE(__llist_for_each_entry_mutable,			\
+		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
 
 /**
  * llist_empty - tests whether a lock-less list is empty
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-22  4:05 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

From: Kaitao Cheng <chengkaitao@kylinos.cn>

The list_for_each*_safe() helpers are used when the loop body may
remove the current entry.  Their API exposes the temporary cursor at
every call site, even though most users only need it for the iterator
implementation and never reference it in the loop body.

Add *_mutable() variants for list and hlist iteration.  The new helpers
support both forms: callers may keep passing an explicit temporary cursor
when they need to inspect or reset it, or omit it and let the helper use
a unique internal cursor.

This makes call sites that only mutate the list through the current entry
less noisy, while keeping the existing *_safe() helpers available for
compatibility.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 231 insertions(+), 38 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 09d979976b3b..1081def7cea9 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -7,6 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/poison.h>
 #include <linux/const.h>
+#include <linux/args.h>
 
 #include <asm/barrier.h>
 
@@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
 #define list_for_each_prev(pos, head) \
 	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
 
-/**
- * list_for_each_safe - iterate over a list safe against removal of list entry
- * @pos:	the &struct list_head to use as a loop cursor.
- * @n:		another &struct list_head to use as temporary storage
- * @head:	the head for your list.
+/*
+ * list_for_each_safe is an old interface, use list_for_each_mutable instead.
  */
 #define list_for_each_safe(pos, n, head) \
 	for (pos = (head)->next, n = pos->next; \
 	     !list_is_head(pos, (head)); \
 	     pos = n, n = pos->next)
 
+#define __list_for_each_mutable_internal(pos, tmp, head)		\
+	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
+	     !list_is_head(pos, (head));				\
+	     pos = tmp, tmp = pos->next)
+
+#define __list_for_each_mutable1(pos, head)				\
+	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
+
+#define __list_for_each_mutable2(pos, next, head)			\
+	list_for_each_safe(pos, next, head)
+
 /**
- * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
+ * list_for_each_mutable - iterate over a list safe against entry removal
  * @pos:	the &struct list_head to use as a loop cursor.
- * @n:		another &struct list_head to use as temporary storage
- * @head:	the head for your list.
+ * @...:	either (head) or (next, head)
+ *
+ * next:	another &struct list_head to use as optional temporary storage.
+ *		The temporary cursor is internal unless explicitly supplied by
+ *		the caller.
+ * head:	the head for your list.
+ */
+#define list_for_each_mutable(pos, ...)					\
+	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
+		(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_prev_safe is an old interface, use list_for_each_prev_mutable instead.
  */
 #define list_for_each_prev_safe(pos, n, head) \
 	for (pos = (head)->prev, n = pos->prev; \
 	     !list_is_head(pos, (head)); \
 	     pos = n, n = pos->prev)
 
+#define __list_for_each_prev_mutable_internal(pos, tmp, head)		\
+	for (typeof(pos) tmp = (pos = (head)->prev)->prev;		\
+	     !list_is_head(pos, (head));				\
+	     pos = tmp, tmp = pos->prev)
+
+#define __list_for_each_prev_mutable1(pos, head)			\
+	__list_for_each_prev_mutable_internal(pos, __UNIQUE_ID(prev), head)
+
+#define __list_for_each_prev_mutable2(pos, prev, head)			\
+	list_for_each_prev_safe(pos, prev, head)
+
+/**
+ * list_for_each_prev_mutable - iterate over a list backwards safe against entry removal
+ * @pos:	the &struct list_head to use as a loop cursor.
+ * @...:	either (head) or (prev, head)
+ *
+ * prev:	another &struct list_head to use as optional temporary storage.
+ *		The temporary cursor is internal unless explicitly supplied by
+ *		the caller.
+ * head:	the head for your list.
+ */
+#define list_for_each_prev_mutable(pos, ...)				\
+	CONCATENATE(__list_for_each_prev_mutable, COUNT_ARGS(__VA_ARGS__)) \
+		(pos, __VA_ARGS__)
+
 /**
  * list_count_nodes - count nodes in the list
  * @head:	the head for your list.
@@ -895,12 +940,8 @@ static inline size_t list_count_nodes(struct list_head *head)
 	for (; !list_entry_is_head(pos, head, member);			\
 	     pos = list_prev_entry(pos, member))
 
-/**
- * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
- * @pos:	the type * to use as a loop cursor.
- * @n:		another type * to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the list_head within the struct.
+/*
+ * list_for_each_entry_safe is an old interface, use list_for_each_entry_mutable instead.
  */
 #define list_for_each_entry_safe(pos, n, head, member)			\
 	for (pos = list_first_entry(head, typeof(*pos), member),	\
@@ -908,15 +949,36 @@ static inline size_t list_count_nodes(struct list_head *head)
 	     !list_entry_is_head(pos, head, member); 			\
 	     pos = n, n = list_next_entry(n, member))
 
+#define __list_for_each_entry_mutable_internal(pos, tmp, head, member)	\
+	for (typeof(pos) tmp = list_next_entry(pos =			\
+		list_first_entry(head, typeof(*pos), member), member);	\
+	     !list_entry_is_head(pos, head, member);			\
+	     pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable2(pos, head, member)		\
+	__list_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable3(pos, next, head, member)		\
+	list_for_each_entry_safe(pos, next, head, member)
+
 /**
- * list_for_each_entry_safe_continue - continue list iteration safe against removal
+ * list_for_each_entry_mutable - iterate over a list safe against entry removal
  * @pos:	the type * to use as a loop cursor.
- * @n:		another type * to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the list_head within the struct.
+ * @...:	either (head, member) or (next, head, member)
  *
- * Iterate over list of given type, continuing after current point,
- * safe against removal of list entry.
+ * next:	another type * to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * head:	the head for your list.
+ * member:	the name of the list_head within the struct.
+ */
+#define list_for_each_entry_mutable(pos, ...)				\
+	CONCATENATE(__list_for_each_entry_mutable, COUNT_ARGS(__VA_ARGS__)) \
+		(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_continue is an old interface,
+ * use list_for_each_entry_mutable_continue instead.
  */
 #define list_for_each_entry_safe_continue(pos, n, head, member) 		\
 	for (pos = list_next_entry(pos, member), 				\
@@ -924,30 +986,79 @@ static inline size_t list_count_nodes(struct list_head *head)
 	     !list_entry_is_head(pos, head, member);				\
 	     pos = n, n = list_next_entry(n, member))
 
+#define __list_for_each_entry_mutable_continue_internal(pos, tmp, head, member) \
+	for (typeof(pos) tmp = list_next_entry(pos =			\
+		list_next_entry(pos, member), member);			\
+	     !list_entry_is_head(pos, head, member);			\
+	     pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_continue2(pos, head, member)	\
+	__list_for_each_entry_mutable_continue_internal(pos,		\
+		__UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable_continue3(pos, next, head, member) \
+	list_for_each_entry_safe_continue(pos, next, head, member)
+
 /**
- * list_for_each_entry_safe_from - iterate over list from current point safe against removal
+ * list_for_each_entry_mutable_continue - continue list iteration safe against removal
  * @pos:	the type * to use as a loop cursor.
- * @n:		another type * to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the list_head within the struct.
+ * @...:	either (head, member) or (next, head, member)
  *
- * Iterate over list of given type from current point, safe against
- * removal of list entry.
+ * next:	another type * to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * head:	the head for your list.
+ * member:	the name of the list_head within the struct.
+ *
+ * Iterate over list of given type, continuing after current point,
+ * safe against removal of list entry.
+ */
+#define list_for_each_entry_mutable_continue(pos, ...)			\
+	CONCATENATE(__list_for_each_entry_mutable_continue,		\
+		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_from is an old interface,
+ * use list_for_each_entry_mutable_from instead.
  */
 #define list_for_each_entry_safe_from(pos, n, head, member) 			\
 	for (n = list_next_entry(pos, member);					\
 	     !list_entry_is_head(pos, head, member);				\
 	     pos = n, n = list_next_entry(n, member))
 
+#define __list_for_each_entry_mutable_from_internal(pos, tmp, head, member) \
+	for (typeof(pos) tmp = list_next_entry(pos, member);		\
+	     !list_entry_is_head(pos, head, member);			\
+	     pos = tmp, tmp = list_next_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_from2(pos, head, member)		\
+	__list_for_each_entry_mutable_from_internal(pos,		\
+		__UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_mutable_from3(pos, next, head, member)	\
+	list_for_each_entry_safe_from(pos, next, head, member)
+
 /**
- * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
+ * list_for_each_entry_mutable_from - iterate over list from current point safe against removal
  * @pos:	the type * to use as a loop cursor.
- * @n:		another type * to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the list_head within the struct.
+ * @...:	either (head, member) or (next, head, member)
  *
- * Iterate backwards over list of given type, safe against removal
- * of list entry.
+ * next:	another type * to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * head:	the head for your list.
+ * member:	the name of the list_head within the struct.
+ *
+ * Iterate over list of given type from current point, safe against
+ * removal of list entry.
+ */
+#define list_for_each_entry_mutable_from(pos, ...)			\
+	CONCATENATE(__list_for_each_entry_mutable_from,			\
+		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
+/*
+ * list_for_each_entry_safe_reverse is an old interface,
+ * use list_for_each_entry_mutable_reverse instead.
  */
 #define list_for_each_entry_safe_reverse(pos, n, head, member)		\
 	for (pos = list_last_entry(head, typeof(*pos), member),		\
@@ -955,6 +1066,37 @@ static inline size_t list_count_nodes(struct list_head *head)
 	     !list_entry_is_head(pos, head, member); 			\
 	     pos = n, n = list_prev_entry(n, member))
 
+#define __list_for_each_entry_mutable_reverse_internal(pos, tmp, head, member) \
+	for (typeof(pos) tmp = list_prev_entry(pos =			\
+		list_last_entry(head, typeof(*pos), member), member);	\
+	     !list_entry_is_head(pos, head, member);			\
+	     pos = tmp, tmp = list_prev_entry(tmp, member))
+
+#define __list_for_each_entry_mutable_reverse2(pos, head, member)	\
+	__list_for_each_entry_mutable_reverse_internal(pos,		\
+		__UNIQUE_ID(prev), head, member)
+
+#define __list_for_each_entry_mutable_reverse3(pos, prev, head, member)	\
+	list_for_each_entry_safe_reverse(pos, prev, head, member)
+
+/**
+ * list_for_each_entry_mutable_reverse - iterate backwards over list safe against removal
+ * @pos:	the type * to use as a loop cursor.
+ * @...:	either (head, member) or (prev, head, member)
+ *
+ * prev:	another type * to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * head:	the head for your list.
+ * member:	the name of the list_head within the struct.
+ *
+ * Iterate backwards over list of given type, safe against removal
+ * of list entry.
+ */
+#define list_for_each_entry_mutable_reverse(pos, ...)			\
+	CONCATENATE(__list_for_each_entry_mutable_reverse,		\
+		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
 /**
  * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
  * @pos:	the loop cursor used in the list_for_each_entry_safe loop
@@ -1189,6 +1331,31 @@ static inline void hlist_splice_init(struct hlist_head *from,
 	for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
 	     pos = n)
 
+#define __hlist_for_each_mutable_internal(pos, tmp, head)		\
+	for (typeof(pos) tmp = (pos = (head)->first) ? pos->next : NULL; \
+	     pos;							\
+	     pos = tmp, tmp = pos ? pos->next : NULL)
+
+#define __hlist_for_each_mutable1(pos, head)				\
+	__hlist_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
+
+#define __hlist_for_each_mutable2(pos, next, head)			\
+	hlist_for_each_safe(pos, next, head)
+
+/**
+ * hlist_for_each_mutable - iterate over a hlist safe against entry removal
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @...:	either (head) or (next, head)
+ *
+ * next:	another &struct hlist_node to use as optional temporary storage.
+ *		The temporary cursor is internal unless explicitly supplied by
+ *		the caller.
+ * head:	the head for your hlist.
+ */
+#define hlist_for_each_mutable(pos, ...)				\
+	CONCATENATE(__hlist_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
+		(pos, __VA_ARGS__)
+
 #define hlist_entry_safe(ptr, type, member) \
 	({ typeof(ptr) ____ptr = (ptr); \
 	   ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
@@ -1224,18 +1391,44 @@ static inline void hlist_splice_init(struct hlist_head *from,
 	for (; pos;							\
 	     pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
 
-/**
- * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
- * @pos:	the type * to use as a loop cursor.
- * @n:		a &struct hlist_node to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the hlist_node within the struct.
+/*
+ * hlist_for_each_entry_safe is an old interface, use hlist_for_each_entry_mutable instead.
  */
 #define hlist_for_each_entry_safe(pos, n, head, member) 		\
 	for (pos = hlist_entry_safe((head)->first, typeof(*pos), member);\
 	     pos && ({ n = pos->member.next; 1; });			\
 	     pos = hlist_entry_safe(n, typeof(*pos), member))
 
+#define __hlist_for_each_entry_mutable_internal(pos, tmp, head, member)	\
+	for (struct hlist_node *tmp = (pos =				\
+		hlist_entry_safe((head)->first, typeof(*pos), member)) ? \
+		pos->member.next : NULL;				\
+	     pos;							\
+	     pos = hlist_entry_safe((tmp), typeof(*pos), member),	\
+		tmp = pos ? pos->member.next : NULL)
+
+#define __hlist_for_each_entry_mutable2(pos, head, member)		\
+	__hlist_for_each_entry_mutable_internal(pos,			\
+		__UNIQUE_ID(next), head, member)
+
+#define __hlist_for_each_entry_mutable3(pos, next, head, member)	\
+	hlist_for_each_entry_safe(pos, next, head, member)
+
+/**
+ * hlist_for_each_entry_mutable - iterate over hlist safe against entry removal
+ * @pos:	the type * to use as a loop cursor.
+ * @...:	either (head, member) or (next, head, member)
+ *
+ * next:	a &struct hlist_node to use as optional temporary storage. The
+ *		temporary cursor is internal unless explicitly supplied by the
+ *		caller.
+ * head:	the head for your hlist.
+ * member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_mutable(pos, ...)				\
+	CONCATENATE(__hlist_for_each_entry_mutable,			\
+		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
+
 /**
  * hlist_count_nodes - count nodes in the hlist
  * @head:	the head for your hlist.
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-22  4:05 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao

From: chengkaitao <chengkaitao@kylinos.cn>

The list_for_each*_safe() helpers are used when the loop body may remove
the current entry.  Their current interface, however, forces every caller
to define a temporary cursor outside the macro and pass it in, even when
the caller never uses that cursor directly.  For most call sites this
extra cursor is just boilerplate required by the macro implementation.

This is awkward because the saved next pointer is an internal detail of
the iteration.  Callers that only remove or move the current entry do not
need to spell it out.

The _safe() suffix has also caused confusion.  Christian Koenig pointed
out that the name is easy to read as a thread-safe variant, especially
for beginners, even though it only means that the iterator keeps enough
state to tolerate removal of the current entry.  He suggested _mutable()
as a clearer description of what the loop permits.

Add *_mutable() iterator variants for list, hlist and llist.  The new
helpers are variadic and support both forms.  In the common case, the
caller omits the temporary cursor and the macro creates a unique internal
cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
explicit temporary cursor, the caller can still pass it and the helper
keeps the existing *_safe() behaviour.

For example, a call site may use the shorter form:

  list_for_each_entry_mutable(pos, head, member)

or keep the explicit temporary cursor form:

  list_for_each_entry_mutable(pos, tmp, head, member)

The existing *_safe() helpers remain available for compatibility.  This
series only converts users in mm, block, kernel, init and io_uring.  If
this approach looks acceptable, the remaining users can be converted in
follow-up series.

Changes in v3 (Christian König, Andy Shevchenko):
- Convert safe list walks to mutable iterators

Changes in v2 (Muchun Song, Andy Shevchenko):
- Drop the list_for_each_entry_mutable*() helpers from v1 and make the
  cursor change directly in the existing list_for_each_entry*() helpers.
- Open-code special list walks that rely on updating the loop cursor in
  the body, preserving their existing traversal semantics.

Link to v2:
https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/

Link to v1:
https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/

Kaitao Cheng (7):
  list: Add mutable iterator variants
  llist: Add mutable iterator variants
  mm: Use mutable list iterators
  block: Use mutable list iterators
  kernel: Use mutable list iterators
  initramfs: Use mutable list iterator
  io_uring: Use mutable list iterators

 block/bfq-iosched.c                 |  17 +-
 block/blk-cgroup.c                  |  12 +-
 block/blk-flush.c                   |   4 +-
 block/blk-iocost.c                  |  18 +-
 block/blk-mq.c                      |   8 +-
 block/blk-throttle.c                |   4 +-
 block/kyber-iosched.c               |   4 +-
 block/partitions/ldm.c              |   8 +-
 block/sed-opal.c                    |   4 +-
 include/linux/list.h                | 269 ++++++++++++++++++++++++----
 include/linux/llist.h               |  81 +++++++--
 init/initramfs.c                    |   5 +-
 io_uring/cancel.c                   |   6 +-
 io_uring/poll.c                     |   3 +-
 io_uring/rw.c                       |   4 +-
 io_uring/timeout.c                  |   8 +-
 io_uring/uring_cmd.c                |   3 +-
 kernel/audit_tree.c                 |   4 +-
 kernel/audit_watch.c                |  16 +-
 kernel/auditfilter.c                |   4 +-
 kernel/auditsc.c                    |   4 +-
 kernel/bpf/arena.c                  |  10 +-
 kernel/bpf/arraymap.c               |   8 +-
 kernel/bpf/bpf_local_storage.c      |   3 +-
 kernel/bpf/bpf_lru_list.c           |  25 ++-
 kernel/bpf/btf.c                    |  18 +-
 kernel/bpf/cgroup.c                 |   7 +-
 kernel/bpf/cpumap.c                 |   4 +-
 kernel/bpf/devmap.c                 |  10 +-
 kernel/bpf/helpers.c                |   8 +-
 kernel/bpf/local_storage.c          |   4 +-
 kernel/bpf/memalloc.c               |  16 +-
 kernel/bpf/offload.c                |   8 +-
 kernel/bpf/states.c                 |   4 +-
 kernel/bpf/stream.c                 |   4 +-
 kernel/bpf/verifier.c               |   6 +-
 kernel/cgroup/cgroup-v1.c           |   4 +-
 kernel/cgroup/cgroup.c              |  54 +++---
 kernel/cgroup/dmem.c                |  12 +-
 kernel/cgroup/rdma.c                |   8 +-
 kernel/events/core.c                |  44 +++--
 kernel/events/uprobes.c             |  12 +-
 kernel/exit.c                       |   8 +-
 kernel/fail_function.c              |   4 +-
 kernel/gcov/clang.c                 |   4 +-
 kernel/irq_work.c                   |   4 +-
 kernel/kexec_core.c                 |   4 +-
 kernel/kprobes.c                    |  16 +-
 kernel/livepatch/core.c             |   4 +-
 kernel/livepatch/core.h             |   4 +-
 kernel/liveupdate/kho_block.c       |   4 +-
 kernel/liveupdate/luo_flb.c         |   4 +-
 kernel/locking/rwsem.c              |   2 +-
 kernel/locking/test-ww_mutex.c      |   2 +-
 kernel/module/main.c                |  11 +-
 kernel/padata.c                     |   4 +-
 kernel/power/snapshot.c             |   8 +-
 kernel/power/wakelock.c             |   4 +-
 kernel/printk/printk.c              |  11 +-
 kernel/ptrace.c                     |   4 +-
 kernel/rcu/rcutorture.c             |   3 +-
 kernel/rcu/tasks.h                  |   9 +-
 kernel/rcu/tree.c                   |   6 +-
 kernel/resource.c                   |   4 +-
 kernel/sched/core.c                 |   4 +-
 kernel/sched/ext.c                  |  22 +--
 kernel/sched/fair.c                 |  28 +--
 kernel/sched/topology.c             |   4 +-
 kernel/sched/wait.c                 |   4 +-
 kernel/seccomp.c                    |   4 +-
 kernel/signal.c                     |  11 +-
 kernel/smp.c                        |   4 +-
 kernel/taskstats.c                  |   8 +-
 kernel/time/clockevents.c           |   6 +-
 kernel/time/clocksource.c           |   4 +-
 kernel/time/posix-cpu-timers.c      |   4 +-
 kernel/time/posix-timers.c          |   3 +-
 kernel/torture.c                    |   3 +-
 kernel/trace/bpf_trace.c            |   4 +-
 kernel/trace/ftrace.c               |  49 +++--
 kernel/trace/ring_buffer.c          |  25 ++-
 kernel/trace/trace.c                |  12 +-
 kernel/trace/trace_dynevent.c       |   6 +-
 kernel/trace/trace_dynevent.h       |   5 +-
 kernel/trace/trace_events.c         |  35 ++--
 kernel/trace/trace_events_filter.c  |   4 +-
 kernel/trace/trace_events_hist.c    |   8 +-
 kernel/trace/trace_events_trigger.c |  17 +-
 kernel/trace/trace_events_user.c    |  16 +-
 kernel/trace/trace_stat.c           |   4 +-
 kernel/user-return-notifier.c       |   3 +-
 kernel/workqueue.c                  |  16 +-
 mm/backing-dev.c                    |   8 +-
 mm/balloon.c                        |   8 +-
 mm/cma.c                            |   4 +-
 mm/compaction.c                     |   4 +-
 mm/damon/core.c                     |   4 +-
 mm/damon/sysfs-schemes.c            |   4 +-
 mm/dmapool.c                        |   4 +-
 mm/huge_memory.c                    |   8 +-
 mm/hugetlb.c                        |  56 +++---
 mm/hugetlb_vmemmap.c                |  16 +-
 mm/khugepaged.c                     |  14 +-
 mm/kmemleak.c                       |   7 +-
 mm/ksm.c                            |  25 +--
 mm/list_lru.c                       |   4 +-
 mm/memcontrol-v1.c                  |   8 +-
 mm/memory-failure.c                 |  12 +-
 mm/memory-tiers.c                   |   4 +-
 mm/migrate.c                        |  23 ++-
 mm/mmu_notifier.c                   |   9 +-
 mm/page_alloc.c                     |   8 +-
 mm/page_reporting.c                 |   2 +-
 mm/percpu.c                         |  11 +-
 mm/pgtable-generic.c                |   4 +-
 mm/rmap.c                           |  10 +-
 mm/shmem.c                          |   9 +-
 mm/slab_common.c                    |  14 +-
 mm/slub.c                           |  33 ++--
 mm/swapfile.c                       |   4 +-
 mm/userfaultfd.c                    |  12 +-
 mm/vmalloc.c                        |  24 +--
 mm/vmscan.c                         |   7 +-
 mm/zsmalloc.c                       |   4 +-
 124 files changed, 875 insertions(+), 681 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Xuan Zhuo @ 2026-06-22  2:40 UTC (permalink / raw)
  To: Menglong Dong
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, virtualization, linux-kernel, eperezma
In-Reply-To: <20260616115912.513183-1-dongml2@chinatelecom.cn>

On Tue, 16 Jun 2026 19:59:12 +0800, Menglong Dong <menglong8.dong@gmail.com> wrote:
> For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> in the tx path for example: we set xsk_set_tx_need_wakeup() in
> virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> anywhere, which means the user will call send() for every packet.
>
> We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> Otherwise, we will clear the wakeup flag.
>
> Race condition is considered for tx path.
>
> Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")

This is not a bug, so we do not need this.
And you post this to net-next.


> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> ---
> v3:
> - remove the confusing comment
>
> v2:
> - add the Fixes tag
> ---
>  drivers/net/virtio_net.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..6e099edef6e9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1440,8 +1440,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	struct virtnet_sq_free_stats stats = {};
>  	struct net_device *dev = vi->dev;
> +	int sent, vring_size;
> +	bool need_wakeup;
>  	u64 kicks = 0;
> -	int sent;
>
>  	/* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
>  	 * free_old_xmit().
> @@ -1451,8 +1452,25 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	if (stats.xsk)
>  		xsk_tx_completed(sq->xsk_pool, stats.xsk);
>
> +	vring_size = virtqueue_get_vring_size(sq->vq);
> +	need_wakeup = xsk_uses_need_wakeup(pool);
> +
> +	if (need_wakeup && vring_size == sq->vq->num_free)
> +		xsk_set_tx_need_wakeup(pool);

You need to comment this.


> +
>  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
>
> +	if (need_wakeup) {
> +		if (vring_size == sq->vq->num_free)
> +			/* we can't wake up by ourself, and it should be done
> +			 * by the user.
> +			 */
> +			xsk_set_tx_need_wakeup(pool);
> +		else
> +			/* we can wake up from skb_xmit_done() */
> +			xsk_clear_tx_need_wakeup(pool);
> +	}
> +
>  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
>  		check_sq_full_and_disable(vi, vi->dev, sq);


After fixed above comments, you can add:

Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Thanks.


>
> @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	u64_stats_add(&sq->stats.xdp_tx,  sent);
>  	u64_stats_update_end(&sq->stats.syncp);
>
> -	if (xsk_uses_need_wakeup(pool))
> -		xsk_set_tx_need_wakeup(pool);
> -
>  	return sent;
>  }
>
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Michael S. Tsirkin @ 2026-06-21 22:31 UTC (permalink / raw)
  To: Menglong Dong
  Cc: xuanzhuo, eperezma, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260616115912.513183-1-dongml2@chinatelecom.cn>

On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
> For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> in the tx path for example: we set xsk_set_tx_need_wakeup() in
> virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> anywhere, which means the user will call send() for every packet.
> 
> We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> Otherwise, we will clear the wakeup flag.
> 
> Race condition is considered for tx path.
> 
> Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>

thanks for the patch! yes something to improve.

> ---
> v3:
> - remove the confusing comment
> 
> v2:
> - add the Fixes tag
> ---
>  drivers/net/virtio_net.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..6e099edef6e9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1440,8 +1440,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	struct virtnet_sq_free_stats stats = {};
>  	struct net_device *dev = vi->dev;
> +	int sent, vring_size;
> +	bool need_wakeup;
>  	u64 kicks = 0;
> -	int sent;
>  
>  	/* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
>  	 * free_old_xmit().
> @@ -1451,8 +1452,25 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	if (stats.xsk)
>  		xsk_tx_completed(sq->xsk_pool, stats.xsk);
>  
> +	vring_size = virtqueue_get_vring_size(sq->vq);
> +	need_wakeup = xsk_uses_need_wakeup(pool);
> +
> +	if (need_wakeup && vring_size == sq->vq->num_free)
> +		xsk_set_tx_need_wakeup(pool);
> +

why are we doing this here?
the check after virtnet_xsk_xmit_batch not enough?
I vaguely think it's some kind of race we are closing?
Pls add a comment to explain.

>  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
>  
> +	if (need_wakeup) {
> +		if (vring_size == sq->vq->num_free)
> +			/* we can't wake up by ourself, and it should be done
> +			 * by the user.
> +			 */
> +			xsk_set_tx_need_wakeup(pool);
> +		else
> +			/* we can wake up from skb_xmit_done() */
> +			xsk_clear_tx_need_wakeup(pool);

But what if we don't have get tx napi so no wakeup in skb_xmit_done?


> +	}
> +
>  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
>  		check_sq_full_and_disable(vi, vi->dev, sq);
>  
> @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>  	u64_stats_add(&sq->stats.xdp_tx,  sent);
>  	u64_stats_update_end(&sq->stats.syncp);
>  
> -	if (xsk_uses_need_wakeup(pool))
> -		xsk_set_tx_need_wakeup(pool);
> -
>  	return sent;
>  }
>  
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Jakub Kicinski @ 2026-06-21 22:06 UTC (permalink / raw)
  To: xuanzhuo
  Cc: Menglong Dong, eperezma, mst, jasowang, andrew+netdev, davem,
	edumazet, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260616115912.513183-1-dongml2@chinatelecom.cn>

On Tue, 16 Jun 2026 19:59:12 +0800 Menglong Dong wrote:
> For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> in the tx path for example: we set xsk_set_tx_need_wakeup() in
> virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> anywhere, which means the user will call send() for every packet.
> 
> We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> Otherwise, we will clear the wakeup flag.
> 
> Race condition is considered for tx path.

Seems to follow what mlx5 does so presumably this is fine but IDK if
there's anything virtio-specific that we need to be worried about.

Xuan Zhuo, please TAL?
-- 
mping: VIRTIO NET DRIVER

^ permalink raw reply

* [PATCH v6 12/12] nvdimm: virtio_pmem: drain requests in freeze
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

virtio_pmem_freeze() currently deletes virtqueues and resets the device
without waking threads waiting for a virtqueue descriptor or a host
completion.

Mark the request virtqueue broken before reset. This makes new submissions
fail fast and lets -ENOSPC waiters leave the wait list. Reset the device
before draining used and unused request tokens, then delete the virtqueues.
This wakes waiters with -EIO. It also keeps the detach call on a quiesced
device.

Clear req_vq after del_vqs(), and make drain tolerate a NULL queue, so
remove after freeze does not dereference a stale virtqueue pointer.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Clear req_vq after del_vqs() and make drain tolerate a NULL queue.
Changes in v5:
- Reset the device before draining used and unused request tokens.
- Use the split broken-marking and post-reset drain helpers.
v2->v3:
- No change.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c   |  3 +++
 drivers/nvdimm/virtio_pmem.c | 36 +++++++++++++++++++++++++++++++-----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index fb9391ebc46e7..ce4032dc07628 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -93,6 +93,9 @@ void virtio_pmem_drain(struct virtio_pmem *vpmem)
 	struct virtio_pmem_request *req;
 	unsigned int len;
 
+	if (!vpmem->req_vq)
+		return;
+
 	while ((req = virtqueue_get_buf(vpmem->req_vq, &len)) != NULL) {
 		virtio_pmem_clear_inflight(vpmem, req);
 		virtio_pmem_complete_err(req);
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index b272e9279ef23..fef792f725db2 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -17,11 +17,16 @@ static struct virtio_device_id id_table[] = {
  /* Initialize virt queue */
 static int init_vq(struct virtio_pmem *vpmem)
 {
+	int err;
+
 	/* single vq */
 	vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
 					virtio_pmem_host_ack, "flush_queue");
-	if (IS_ERR(vpmem->req_vq))
-		return PTR_ERR(vpmem->req_vq);
+	if (IS_ERR(vpmem->req_vq)) {
+		err = PTR_ERR(vpmem->req_vq);
+		vpmem->req_vq = NULL;
+		return err;
+	}
 
 	spin_lock_init(&vpmem->pmem_lock);
 	INIT_LIST_HEAD(&vpmem->req_list);
@@ -31,6 +36,15 @@ static int init_vq(struct virtio_pmem *vpmem)
 	return 0;
 };
 
+static void virtio_pmem_del_vqs(struct virtio_pmem *vpmem)
+{
+	if (!vpmem->req_vq)
+		return;
+
+	vpmem->vdev->config->del_vqs(vpmem->vdev);
+	vpmem->req_vq = NULL;
+}
+
 static int virtio_pmem_validate(struct virtio_device *vdev)
 {
 	struct virtio_shm_region shm_reg;
@@ -132,7 +146,7 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 	virtio_reset_device(vdev);
 	nvdimm_bus_unregister(vpmem->nvdimm_bus);
 out_vq:
-	vdev->config->del_vqs(vdev);
+	virtio_pmem_del_vqs(vpmem);
 out_err:
 	return err;
 }
@@ -154,14 +168,26 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 	nvdimm_bus_unregister(nvdimm_bus);
-	vdev->config->del_vqs(vdev);
+	virtio_pmem_del_vqs(vpmem);
 }
 
 static int virtio_pmem_freeze(struct virtio_device *vdev)
 {
-	vdev->config->del_vqs(vdev);
+	struct virtio_pmem *vpmem = vdev->priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_mark_broken(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
 	virtio_reset_device(vdev);
 
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_drain(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	virtio_pmem_del_vqs(vpmem);
+
 	return 0;
 }
 
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 11/12] nvdimm: virtio_pmem: converge broken virtqueue to -EIO
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

dmesg reports virtqueue failure and device reset:
virtio_pmem virtio2: failed to send command to
virtio pmem device, no free slots in the virtqueue
virtio_pmem virtio2: virtio pmem device
needs a reset

virtio_pmem_flush() can wait for a free virtqueue descriptor (-ENOSPC).
It can also wait for host completion. If the request virtqueue breaks,
those waiters may never make progress. One example is notify failure from
virtqueue_kick().

Track a device-level broken state and converge the failure to -EIO. New
requests fail fast, -ENOSPC waiters are unlinked and woken, and the
currently submitted request is woken so its host_acked waiter can return
without waiting forever for host completion. Completed requests are forced
to report an error after the queue is marked broken.

Do not detach unused buffers from an active virtqueue. Runtime
broken-queue handling only stops new submissions and wakes local waiters.
Removal resets the device first. It then drains request tokens. After
that, the device no longer owns the buffers when the virtqueue reference
is dropped.

Closes: https://lore.kernel.org/r/202512250116.ewtzlD0g-lkp@intel.com/
Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Wake the in-flight host-completion waiter when marking the queue broken.
- Track req_inflight and clear it on completion/drain paths.
- Return -EIO if the queue breaks before a host response is observed.
Changes in v5:
- Split broken marking from token draining.
- Do not call virtqueue_detach_unused_buf() on an active queue.
- Reset the device before draining tokens in remove().
- Do not let the host-completion wait return only because the device is
  marked broken.
v2->v3:
- Add raw dmesg excerpt to the patch description.
- Drop timestamps from the embedded dmesg.
- Fold the CONFIG_VIRTIO_PMEM=m export fix into this patch.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
- Use kmalloc_obj(*req_data) at the allocation site to match current nvdimm
  code.

 drivers/nvdimm/nd_virtio.c   | 117 +++++++++++++++++++++++++++++++----
 drivers/nvdimm/virtio_pmem.c |  15 ++++-
 drivers/nvdimm/virtio_pmem.h |   8 +++
 3 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 35d36bd36a526..fb9391ebc46e7 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -30,6 +30,12 @@ static bool virtio_pmem_req_done(struct virtio_pmem_request *req)
 	return smp_load_acquire(&req->done);
 }
 
+static void virtio_pmem_complete_err(struct virtio_pmem_request *req)
+{
+	req->resp.ret = cpu_to_le32(1);
+	virtio_pmem_signal_done(req);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -44,6 +50,63 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 	wake_up(&req_buf->wq_buf);
 }
 
+static void virtio_pmem_wake_all_waiters(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req, *tmp;
+
+	list_for_each_entry_safe(req, tmp, &vpmem->req_list, list) {
+		list_del_init(&req->list);
+		WRITE_ONCE(req->wq_buf_avail, true);
+		wake_up(&req->wq_buf);
+	}
+}
+
+static void virtio_pmem_clear_inflight(struct virtio_pmem *vpmem,
+				       struct virtio_pmem_request *req)
+{
+	if (vpmem->req_inflight == req)
+		vpmem->req_inflight = NULL;
+}
+
+static void virtio_pmem_wake_inflight(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req = vpmem->req_inflight;
+
+	if (req)
+		wake_up(&req->host_acked);
+}
+
+void virtio_pmem_mark_broken(struct virtio_pmem *vpmem)
+{
+	if (!READ_ONCE(vpmem->broken)) {
+		WRITE_ONCE(vpmem->broken, true);
+		dev_err_once(&vpmem->vdev->dev, "virtqueue is broken\n");
+	}
+
+	virtio_pmem_wake_inflight(vpmem);
+	virtio_pmem_wake_all_waiters(vpmem);
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_mark_broken);
+
+void virtio_pmem_drain(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req;
+	unsigned int len;
+
+	while ((req = virtqueue_get_buf(vpmem->req_vq, &len)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req);
+		virtio_pmem_complete_err(req);
+		kref_put(&req->kref, virtio_pmem_req_release);
+	}
+
+	while ((req = virtqueue_detach_unused_buf(vpmem->req_vq)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req);
+		virtio_pmem_complete_err(req);
+		kref_put(&req->kref, virtio_pmem_req_release);
+	}
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_drain);
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
@@ -54,8 +117,12 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req_data);
 		virtio_pmem_wake_one_waiter(vpmem);
-		virtio_pmem_signal_done(req_data);
+		if (READ_ONCE(vpmem->broken))
+			virtio_pmem_complete_err(req_data);
+		else
+			virtio_pmem_signal_done(req_data);
 		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -83,6 +150,9 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		return -EIO;
 	}
 
+	if (READ_ONCE(vpmem->broken))
+		return -EIO;
+
 	req_data = kmalloc_obj(*req_data, GFP_NOIO);
 	if (!req_data)
 		return -ENOMEM;
@@ -99,13 +169,18 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	sgs[1] = &ret;
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
-	 /*
-	  * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
-	  * queue does not have free descriptor. We add the request
-	  * to req_list and wait for host_ack to wake us up when free
-	  * slots are available.
-	  */
+	/*
+	 * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
+	 * queue does not have free descriptor. We add the request
+	 * to req_list and wait for host_ack to wake us up when free
+	 * slots are available.
+	 */
 	for (;;) {
+		if (READ_ONCE(vpmem->broken)) {
+			err = -EIO;
+			break;
+		}
+
 		err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
 					GFP_ATOMIC);
 		if (!err) {
@@ -114,6 +189,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 			 * held so completion cannot run concurrently.
 			 */
 			kref_get(&req_data->kref);
+			vpmem->req_inflight = req_data;
 			break;
 		}
 
@@ -127,24 +203,41 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
+		wait_event(req_data->wq_buf,
+			   READ_ONCE(req_data->wq_buf_avail) ||
+			   READ_ONCE(vpmem->broken));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+
+		if (READ_ONCE(vpmem->broken))
+			break;
 	}
 
-	err1 = virtqueue_kick(vpmem->req_vq);
+	if (err == -EIO || virtqueue_is_broken(vpmem->req_vq))
+		virtio_pmem_mark_broken(vpmem);
+
+	err1 = true;
+	if (!err && !READ_ONCE(vpmem->broken)) {
+		err1 = virtqueue_kick(vpmem->req_vq);
+		if (!err1)
+			virtio_pmem_mark_broken(vpmem);
+	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 	/*
 	 * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
 	 * do anything about that.
 	 */
-	if (err || !err1) {
+	if (READ_ONCE(vpmem->broken) || err || !err1) {
 		dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
 		wait_event(req_data->host_acked,
-			   virtio_pmem_req_done(req_data));
-		err = le32_to_cpu(req_data->resp.ret);
+			   virtio_pmem_req_done(req_data) ||
+			   READ_ONCE(vpmem->broken));
+		if (virtio_pmem_req_done(req_data))
+			err = le32_to_cpu(req_data->resp.ret);
+		else
+			err = -EIO;
 	}
 
 	kref_put(&req_data->kref, virtio_pmem_req_release);
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 77b1966619059..b272e9279ef23 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -25,6 +25,8 @@ static int init_vq(struct virtio_pmem *vpmem)
 
 	spin_lock_init(&vpmem->pmem_lock);
 	INIT_LIST_HEAD(&vpmem->req_list);
+	vpmem->req_inflight = NULL;
+	WRITE_ONCE(vpmem->broken, false);
 
 	return 0;
 };
@@ -138,10 +140,21 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 static void virtio_pmem_remove(struct virtio_device *vdev)
 {
 	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+	struct virtio_pmem *vpmem = vdev->priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_mark_broken(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	virtio_reset_device(vdev);
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_drain(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 	nvdimm_bus_unregister(nvdimm_bus);
 	vdev->config->del_vqs(vdev);
-	virtio_reset_device(vdev);
 }
 
 static int virtio_pmem_freeze(struct virtio_device *vdev)
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index 23bff40249c1b..bc7de2b328985 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -52,6 +52,12 @@ struct virtio_pmem {
 	/* List to store deferred work if virtqueue is full */
 	struct list_head req_list;
 
+	/* Request currently owned by the virtqueue. */
+	struct virtio_pmem_request *req_inflight;
+
+	/* Fail fast and wake waiters if the request virtqueue is broken. */
+	bool broken;
+
 	/* Synchronize virtqueue data */
 	spinlock_t pmem_lock;
 
@@ -61,5 +67,7 @@ struct virtio_pmem {
 };
 
 void virtio_pmem_host_ack(struct virtqueue *vq);
+void virtio_pmem_mark_broken(struct virtio_pmem *vpmem);
+void virtio_pmem_drain(struct virtio_pmem *vpmem);
 int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
 #endif
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 10/12] nvdimm: virtio_pmem: isolate DMA request buffers
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

The virtio-pmem request object stores wait queues, flags, and list
pointers next to buffers mapped for virtqueue DMA. The response buffer is
mapped DMA_FROM_DEVICE, so non-coherent DMA invalidation must not share a
cache line with CPU-owned fields.

Keep the request buffer outside the DMA-from-device group and wrap only
the response buffer with __dma_from_device_group_begin/end.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/virtio_pmem.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index 1017e498c9b4c..23bff40249c1b 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -10,6 +10,7 @@
 #ifndef _LINUX_VIRTIO_PMEM_H
 #define _LINUX_VIRTIO_PMEM_H
 
+#include <linux/dma-mapping.h>
 #include <linux/module.h>
 #include <uapi/linux/virtio_pmem.h>
 #include <linux/kref.h>
@@ -19,8 +20,6 @@
 
 struct virtio_pmem_request {
 	struct kref kref;
-	struct virtio_pmem_req req;
-	struct virtio_pmem_resp resp;
 
 	/* Wait queue to process deferred work after ack from host */
 	wait_queue_head_t host_acked;
@@ -30,6 +29,11 @@ struct virtio_pmem_request {
 	wait_queue_head_t wq_buf;
 	bool wq_buf_avail;
 	struct list_head list;
+
+	struct virtio_pmem_req req;
+	__dma_from_device_group_begin(resp);
+	struct virtio_pmem_resp resp;
+	__dma_from_device_group_end(resp);
 };
 
 struct virtio_pmem {
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 09/12] nvdimm: virtio_pmem: publish done with release/acquire
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

virtio_pmem_host_ack() publishes the device response by setting done and
waking the submitter. The submitter reads resp.ret after wait_event()
observes done.

Use smp_store_release() on done and smp_load_acquire() in the wait
condition so the response read is ordered after completion.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/nd_virtio.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 7b6761adf28bc..35d36bd36a526 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -17,6 +17,19 @@ static void virtio_pmem_req_release(struct kref *kref)
 	kfree(req);
 }
 
+static void virtio_pmem_signal_done(struct virtio_pmem_request *req)
+{
+	/* Pairs with smp_load_acquire() in virtio_pmem_req_done(). */
+	smp_store_release(&req->done, true);
+	wake_up(&req->host_acked);
+}
+
+static bool virtio_pmem_req_done(struct virtio_pmem_request *req)
+{
+	/* Pairs with smp_store_release() in virtio_pmem_signal_done(). */
+	return smp_load_acquire(&req->done);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -42,8 +55,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
 		virtio_pmem_wake_one_waiter(vpmem);
-		WRITE_ONCE(req_data->done, true);
-		wake_up(&req_data->host_acked);
+		virtio_pmem_signal_done(req_data);
 		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -130,7 +142,8 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->host_acked, READ_ONCE(req_data->done));
+		wait_event(req_data->host_acked,
+			   virtio_pmem_req_done(req_data));
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 08/12] nvdimm: virtio_pmem: refcount requests for token lifetime
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, stable, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

KASAN reports slab-use-after-free in __wake_up_common():
BUG: KASAN: slab-use-after-free in __wake_up_common+0x114/0x160
Read of size 8 at addr ffff88810fdcb710 by task swapper/0/0

CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.19.0-next-20260220-00006-g1eae5f204ec3 #4 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux
1.17.0-2-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl+0x6d/0xb0
 print_report+0x170/0x4e2
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? __virt_addr_valid+0x1dc/0x380
 kasan_report+0xbc/0xf0
 ? __wake_up_common+0x114/0x160
 ? __wake_up_common+0x114/0x160
 __wake_up_common+0x114/0x160
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 __wake_up+0x36/0x60
 virtio_pmem_host_ack+0x11d/0x3b0
 ? sched_balance_domains+0x29f/0xb00
 ? __pfx_virtio_pmem_host_ack+0x10/0x10
 ? _raw_spin_lock_irqsave+0x98/0x100
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 vring_interrupt+0x1c9/0x5e0
 ? __pfx_vp_interrupt+0x10/0x10
 vp_vring_interrupt+0x87/0x100
 ? __pfx_vp_interrupt+0x10/0x10
 __handle_irq_event_percpu+0x17f/0x550
 ? __pfx__raw_spin_lock+0x10/0x10
 handle_irq_event+0xab/0x1c0
 handle_fasteoi_irq+0x276/0xae0
 __common_interrupt+0x65/0x130
 common_interrupt+0x78/0xa0
 </IRQ>

virtio_pmem_host_ack() wakes a request that has already been freed by the
submitter.

This happens when the request token is still reachable via the virtqueue,
but virtio_pmem_flush() returns and frees it.

Fix the token lifetime by refcounting struct virtio_pmem_request.
virtio_pmem_flush() holds a submitter reference, and the virtqueue holds an
extra reference once the request is queued. The completion path drops the
virtqueue reference, and the submitter drops its reference before
returning.

Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Cc: stable@vger.kernel.org
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Add raw KASAN report to the patch description.
- Drop timestamps from the embedded report.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c   | 34 +++++++++++++++++++++++++++++-----
 drivers/nvdimm/virtio_pmem.h |  2 ++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index da829e9f4bdff..7b6761adf28bc 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,6 +9,14 @@
 #include "virtio_pmem.h"
 #include "nd.h"
 
+static void virtio_pmem_req_release(struct kref *kref)
+{
+	struct virtio_pmem_request *req;
+
+	req = container_of(kref, struct virtio_pmem_request, kref);
+	kfree(req);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -36,6 +44,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 		virtio_pmem_wake_one_waiter(vpmem);
 		WRITE_ONCE(req_data->done, true);
 		wake_up(&req_data->host_acked);
+		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 }
@@ -66,6 +75,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	if (!req_data)
 		return -ENOMEM;
 
+	kref_init(&req_data->kref);
 	WRITE_ONCE(req_data->done, false);
 	init_waitqueue_head(&req_data->host_acked);
 	init_waitqueue_head(&req_data->wq_buf);
@@ -83,10 +93,23 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	  * to req_list and wait for host_ack to wake us up when free
 	  * slots are available.
 	  */
-	while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
-					GFP_ATOMIC)) == -ENOSPC) {
-
-		dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
+	for (;;) {
+		err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
+					GFP_ATOMIC);
+		if (!err) {
+			/*
+			 * Take the virtqueue reference while @pmem_lock is
+			 * held so completion cannot run concurrently.
+			 */
+			kref_get(&req_data->kref);
+			break;
+		}
+
+		if (err != -ENOSPC)
+			break;
+
+		dev_info_ratelimited(&vdev->dev,
+				     "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
 		WRITE_ONCE(req_data->wq_buf_avail, false);
 		list_add_tail(&req_data->list, &vpmem->req_list);
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -95,6 +118,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	}
+
 	err1 = virtqueue_kick(vpmem->req_vq);
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 	/*
@@ -110,7 +134,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-	kfree(req_data);
+	kref_put(&req_data->kref, virtio_pmem_req_release);
 	return err;
 };
 
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index f72cf17f9518f..1017e498c9b4c 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -12,11 +12,13 @@
 
 #include <linux/module.h>
 #include <uapi/linux/virtio_pmem.h>
+#include <linux/kref.h>
 #include <linux/libnvdimm.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 
 struct virtio_pmem_request {
+	struct kref kref;
 	struct virtio_pmem_req req;
 	struct virtio_pmem_resp resp;
 
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
wq_buf_avail). They are observed by waiters without pmem_lock, so make
the accesses explicit single loads/stores and avoid compiler
reordering/caching across the wait/wake paths.

Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out READ_ONCE()/WRITE_ONCE() updates from patch 3/7 (no functional
  change intended).
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 8ed4d6b3a9284..da829e9f4bdff 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -18,9 +18,9 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 
 	req_buf = list_first_entry(&vpmem->req_list,
 				   struct virtio_pmem_request, list);
-	req_buf->wq_buf_avail = true;
+	list_del_init(&req_buf->list);
+	WRITE_ONCE(req_buf->wq_buf_avail, true);
 	wake_up(&req_buf->wq_buf);
-	list_del(&req_buf->list);
 }
 
  /* The interrupt handler */
@@ -34,7 +34,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
 		virtio_pmem_wake_one_waiter(vpmem);
-		req_data->done = true;
+		WRITE_ONCE(req_data->done, true);
 		wake_up(&req_data->host_acked);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -66,7 +66,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	if (!req_data)
 		return -ENOMEM;
 
-	req_data->done = false;
+	WRITE_ONCE(req_data->done, false);
 	init_waitqueue_head(&req_data->host_acked);
 	init_waitqueue_head(&req_data->wq_buf);
 	INIT_LIST_HEAD(&req_data->list);
@@ -87,12 +87,12 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 					GFP_ATOMIC)) == -ENOSPC) {
 
 		dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
-		req_data->wq_buf_avail = false;
+		WRITE_ONCE(req_data->wq_buf_avail, false);
 		list_add_tail(&req_data->list, &vpmem->req_list);
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->wq_buf, req_data->wq_buf_avail);
+		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	}
 	err1 = virtqueue_kick(vpmem->req_vq);
@@ -106,7 +106,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->host_acked, req_data->done);
+		wait_event(req_data->host_acked, READ_ONCE(req_data->done));
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 06/12] nvdimm: virtio_pmem: always wake -ENOSPC waiters
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

virtio_pmem_host_ack() reclaims virtqueue descriptors with
virtqueue_get_buf(). The -ENOSPC waiter wakeup is tied to completing the
returned token. If token completion is skipped for any reason, reclaimed
descriptors may not wake a waiter and the submitter may sleep forever
waiting for a free slot. Always wake one -ENOSPC waiter for each virtqueue
completion before touching the returned token.

Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out the waiter wakeup ordering change from READ_ONCE()/WRITE_ONCE()
  updates (now patch 4/7), per Pankaj's suggestion.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 91ca144607531..8ed4d6b3a9284 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,26 +9,33 @@
 #include "virtio_pmem.h"
 #include "nd.h"
 
+static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req_buf;
+
+	if (list_empty(&vpmem->req_list))
+		return;
+
+	req_buf = list_first_entry(&vpmem->req_list,
+				   struct virtio_pmem_request, list);
+	req_buf->wq_buf_avail = true;
+	wake_up(&req_buf->wq_buf);
+	list_del(&req_buf->list);
+}
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
 	struct virtio_pmem *vpmem = vq->vdev->priv;
-	struct virtio_pmem_request *req_data, *req_buf;
+	struct virtio_pmem_request *req_data;
 	unsigned long flags;
 	unsigned int len;
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+		virtio_pmem_wake_one_waiter(vpmem);
 		req_data->done = true;
 		wake_up(&req_data->host_acked);
-
-		if (!list_empty(&vpmem->req_list)) {
-			req_buf = list_first_entry(&vpmem->req_list,
-					struct virtio_pmem_request, list);
-			req_buf->wq_buf_avail = true;
-			wake_up(&req_buf->wq_buf);
-			list_del(&req_buf->list);
-		}
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 }
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 05/12] nvdimm: virtio_pmem: use GFP_NOIO for flush requests
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

virtio_pmem_flush() can run from pmem_submit_bio() while filesystem IO
is waiting on the flush completion. The request object allocation can
sleep, but it should not enter filesystem or block IO reclaim from this
flush path.

Use GFP_NOIO for the request allocation. The virtqueue descriptor
allocation still uses GFP_ATOMIC because it runs under pmem_lock.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch; keep GFP_NOIO only for the virtio-pmem request allocation.

 drivers/nvdimm/nd_virtio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4b2e9c47af0f5..91ca144607531 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -55,7 +55,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		return -EIO;
 	}
 
-	req_data = kmalloc_obj(*req_data);
+	req_data = kmalloc_obj(*req_data, GFP_NOIO);
 	if (!req_data)
 		return -ENOMEM;
 
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 04/12] nvdimm: virtio_pmem: stop allocating child flush bio
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

pmem_submit_bio() passes the parent bio to nvdimm_flush() for
REQ_FUA. For virtio-pmem this makes async_pmem_flush() allocate
and submit a child PREFLUSH bio chained to the parent.

That child allocation is in the block submit path. Making it
blocking with GFP_NOIO can consume the same global bio mempool that
submit_bio() uses, while making it GFP_ATOMIC can fail under
pressure. A forced failure of the child allocation produced:

virtio_pmem: forcing child bio allocation failure for test
Buffer I/O error on dev pmem0, logical block 0, lost sync page write
EXT4-fs (pmem0): I/O error while writing superblock
EXT4-fs (pmem0): mount failed

Avoid the child bio completely. Flush FUA synchronously, like
REQ_PREFLUSH, then complete the parent after the flush. Since no
child bio can be created, async_pmem_flush() now only issues the
virtio flush and preserves negative errno values.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Replace the child bio allocation fix with synchronous FUA flushing.

 drivers/nvdimm/nd_virtio.c | 22 ++++------------------
 drivers/nvdimm/pmem.c      |  2 +-
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4176046627beb..4b2e9c47af0f5 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -110,27 +110,13 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 /* The asynchronous flush callback function */
 int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 {
-	/*
-	 * Create child bio for asynchronous flush and chain with
-	 * parent bio. Otherwise directly call nd_region flush.
-	 */
-	if (bio && bio->bi_iter.bi_sector != -1) {
-		struct bio *child = bio_alloc(bio->bi_bdev, 0,
-					      REQ_OP_WRITE | REQ_PREFLUSH,
-					      GFP_ATOMIC);
+	int err;
 
-		if (!child)
-			return -ENOMEM;
-		bio_clone_blkg_association(child, bio);
-		child->bi_iter.bi_sector = -1;
-		bio_chain(child, bio);
-		submit_bio(child);
-		return 0;
-	}
-	if (virtio_pmem_flush(nd_region))
+	err = virtio_pmem_flush(nd_region);
+	if (err > 0)
 		return -EIO;
 
-	return 0;
+	return err;
 };
 EXPORT_SYMBOL_GPL(async_pmem_flush);
 MODULE_DESCRIPTION("Virtio Persistent Memory Driver");
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 82ee1ddb3a445..058d2739c95a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -242,7 +242,7 @@ static void pmem_submit_bio(struct bio *bio)
 	}
 
 	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
-		ret = nvdimm_flush(nd_region, bio);
+		ret = nvdimm_flush(nd_region, NULL);
 
 	if (ret)
 		bio->bi_status = errno_to_blk_status(ret);
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 03/12] nvdimm: pmem: guard data loop for dataless bios
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

pmem_submit_bio() handles flush-only bios before and after the data
loop. Keep dataless bios out of bio_for_each_segment() so the data path
only walks bios that actually carry bvec data.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/pmem.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 05d3de33e2706..82ee1ddb3a445 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -217,23 +217,29 @@ static void pmem_submit_bio(struct bio *bio)
 		}
 	}
 
-	do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
-	if (do_acct)
-		start = bio_start_io_acct(bio);
-	bio_for_each_segment(bvec, bio, iter) {
-		if (op_is_write(bio_op(bio)))
-			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
-		else
-			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
-		if (rc) {
-			bio->bi_status = rc;
-			break;
+	if (bio_has_data(bio)) {
+		do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
+		if (do_acct)
+			start = bio_start_io_acct(bio);
+		bio_for_each_segment(bvec, bio, iter) {
+			if (op_is_write(bio_op(bio)))
+				rc = pmem_do_write(pmem, bvec.bv_page,
+						   bvec.bv_offset,
+						   iter.bi_sector,
+						   bvec.bv_len);
+			else
+				rc = pmem_do_read(pmem, bvec.bv_page,
+						  bvec.bv_offset,
+						  iter.bi_sector,
+						  bvec.bv_len);
+			if (rc) {
+				bio->bi_status = rc;
+				break;
+			}
 		}
+		if (do_acct)
+			bio_end_io_acct(bio, start);
 	}
-	if (do_acct)
-		bio_end_io_acct(bio, start);
 
 	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
 		ret = nvdimm_flush(nd_region, bio);
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 02/12] nvdimm: pmem: keep PREFLUSH before data writes
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

pmem_submit_bio() records a REQ_PREFLUSH error, but continues to copy the
bio data and can later overwrite the error with a successful REQ_FUA flush.
That lets data writes run after a failed preflush and can complete the bio
successfully despite the failed ordering barrier.

Run the REQ_PREFLUSH flush synchronously before touching the bio data and
complete the bio with the flush error if it fails. Keep asynchronous flush
chaining for REQ_FUA. At that point, data copy has completed and the parent
bio can wait for the chained flush bio.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v5:
- New patch.

 drivers/nvdimm/pmem.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 92c67fbbc1c85..05d3de33e2706 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -208,8 +208,14 @@ static void pmem_submit_bio(struct bio *bio)
 	struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
 	struct nd_region *nd_region = to_region(pmem);
 
-	if (bio->bi_opf & REQ_PREFLUSH)
-		ret = nvdimm_flush(nd_region, bio);
+	if (bio->bi_opf & REQ_PREFLUSH) {
+		ret = nvdimm_flush(nd_region, NULL);
+		if (ret) {
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return;
+		}
+	}
 
 	do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
 	if (do_acct)
@@ -229,7 +235,7 @@ static void pmem_submit_bio(struct bio *bio)
 	if (do_acct)
 		bio_end_io_acct(bio, start);
 
-	if (bio->bi_opf & REQ_FUA)
+	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
 		ret = nvdimm_flush(nd_region, bio);
 
 	if (ret)
-- 
2.52.0

^ permalink raw reply related

* [PATCH v6 01/12] nvdimm: preserve flush callback errors
From: Li Chen @ 2026-06-21 13:02 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260621130246.2973254-1-me@linux.beauty>

nvdimm_flush() currently converts any non-zero provider flush error to
-EIO. That loses useful errno values from provider callbacks.

A local virtio-pmem mkfs sanity test showed the masking clearly:

  wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
  mkfs.ext4: Input/output error while writing out and closing file system
  nd_region region0: dbg: nvdimm_flush rc=-5

The virtio-pmem callback can return -ENOMEM when async_pmem_flush() fails
to allocate a child flush bio, but nvdimm_flush() hides that as -EIO before
pmem_submit_bio() converts it to a block status.

Return the provider callback error directly. The generic flush path still
returns 0, and pmem_submit_bio() already handles errno-to-blk_status
conversion for bio completion.

Signed-off-by: Li Chen <me@linux.beauty>
---
v3->v4:
- New patch.

 drivers/nvdimm/region_devs.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e35c2e18518f0..0cd96503c0596 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1114,10 +1114,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
 
 	if (!nd_region->flush)
 		rc = generic_nvdimm_flush(nd_region);
-	else {
-		if (nd_region->flush(nd_region, bio))
-			rc = -EIO;
-	}
+	else
+		rc = nd_region->flush(nd_region, bio);
 
 	return rc;
 }
-- 
2.52.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox