Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-22 22:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <CAO9r8zNjyW1rh26vv2vavCM_2-r70EuynU+-7XdEmrBdLL=TkQ@mail.gmail.com>

On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
> >
> > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > (allowed, the default) or "0" (disabled).  A tier is one bit in the
> > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > clears that bit.
> >
> > Since the current use case lacks amount control, it only supports
> > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > usage, relying instead on a fast runtime bitmask check.
> >
> > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > effective state is "0" even if its `mask` is "max"). Maintaining this
> > separately avoids costly cgroup tree traversals to check ancestors at
> > runtime.
> >
> > Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Suggested-by: Yosry Ahmed <yosry@kernel.org>
> > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  20 +++++
> >  Documentation/mm/swap-tier.rst          |   9 +++
> >  include/linux/memcontrol.h              |   5 ++
> >  mm/memcontrol.c                         |  67 ++++++++++++++++
> >  mm/swap_state.c                         |   5 +-
> >  mm/swap_tier.c                          | 102 +++++++++++++++++++++++-
> >  mm/swap_tier.h                          |  57 +++++++++++--
> >  7 files changed, 255 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 6efd0095ed99..4843ffcfd110 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> >         Swap usage hard limit.  If a cgroup's swap usage reaches this
> >         limit, anonymous memory of the cgroup will not be swapped out.
> >
> > +  memory.swap.tiers.max
> > +       A read-write flat-keyed file which exists on non-root
> > +       cgroups.  The default is "max" for every tier.

Hi Yosry,

Sorry, I feel like I'm joining the party late. Apologies if I'm missing
some context or repeating a discussion that's already been had.
Please let me know if that is the case.

One quick tangent:
I was chatting with Nhat last week about swap tiers and its relation to
memory tiering. Nhat brought up a good point, which is that while both
swap tiers and memory tiers provide a clear hierarchy of performance,
only memory tiering allows for movement between the tiers.
AFAICT, swap tiering does not allow for direct migration from a higher
tier swap backend to a lower tier swap backend if the higher tier
backend runs out of memory.

In that sense, I'm not entirely sure if we need to enforce similar
semantics across swap tiering and memory tiering; it seems like there
are some fundamental differences anyways to how we treat these tiers.

> I wonder what should the default behavior be if memory.swap.max is set
> to a value other than "max". Should the limits in
> memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> to keep the behavior consistent with memory tiering.
> 
> Shakeel/Joshua, WDYT?

I think that the motivation behind these tiers is different for swap
and memory. Tiered memory limits is motivated by preventing one
workload from conusming all of a valuable resource, while swap tiers
seems more to do with excluding certain workloads from using performant
tiers and ensuring other workloads stay on those performant tiers.

IOW memory tiers exist for fairness, but it seems like swap tiers exist
for workload performance tiering. But maybe there's a usecase out there
that would want fairness to apply in the swap tiers as well that I am
not seeing.

If that is the case, I think auto-scaling makes sense but can be a bit
tricky, since there is no universal tiered ratio; each workload will
have different tiers it can swap to, so they will all have to calculate
their own ratios. Tiered memory limits escapes this difficulty since we
assume all memory can be placed on all tiers, so we have a system-wide
ratio : -)

Let me know what you think! Have a great day :D
Joshua

^ permalink raw reply

* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Jens Axboe @ 2026-06-22 22:01 UTC (permalink / raw)
  To: tj, josef, linux-block, Zizhi Wo
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>


On Tue, 16 Jun 2026 09:17:46 +0800, Zizhi Wo wrote:
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
> 
>   BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>   Call Trace:
>   ...
>   blkcg_deactivate_policy+0x244/0x4d0
>   ioc_rqos_exit+0x44/0xe0
>   rq_qos_exit+0xba/0x120
>   __del_gendisk+0x50b/0x800
>   del_gendisk+0xff/0x190
>   ...
> 
> [...]

Applied, thanks!

[1/1] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
      commit: 3ed9b4779a4aa3f44cd9f78627498d7adac40daa

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block, bfq: protect async queue reset with blkcg locks
From: Jens Axboe @ 2026-06-22 22:00 UTC (permalink / raw)
  To: Yu Kuai, Tejun Heo, Josef Bacik, Arianna Avanzini, Paolo Valente,
	Cen Zhang
  Cc: linux-block, cgroups, linux-kernel, baijiaju1990
In-Reply-To: <20260621135930.2657810-1-zzzccc427@gmail.com>


On Sun, 21 Jun 2026 21:59:30 +0800, Cen Zhang wrote:
> Writing 0 to BFQ's low_latency attribute ends weight raising for active,
> idle and async queues. The async cgroup path walks q->blkg_list, converts
> each blkg to BFQ policy data and then reads bfqg->async_bfqq and
> bfqg->async_idle_bfqq.
> 
> That walk was protected only by bfqd->lock. blkcg release work is
> serialized by q->blkcg_mutex and q->queue_lock instead, and
> blkg_free_workfn() can call BFQ's pd_free_fn before it removes
> blkg->q_node from q->blkg_list. A low_latency reset can therefore still
> find the blkg on the queue list after the BFQ policy data has been freed.
> 
> [...]

Applied, thanks!

[1/1] block, bfq: protect async queue reset with blkcg locks
      commit: 17b2d950a3c0328ed749476e6118ca869b3ca8b5

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v9 0/6] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: Yosry Ahmed @ 2026-06-22 21:23 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, youngjun.park, linux-mm, cgroups, linux-kernel,
	kasong, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com>

On Sat, Jun 20, 2026 at 11:16 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
>
> This is the v9 series of the swap tier patchset.
>
> The main change in this version is the addition of selftests for the tier
> interfaces, requested by Nhat; see the changelog below for the other changes.
> I designed the test cases and wrote the selftests with some AI assistance.
>
> For context, the bulk of the series is unchanged since v8, with great thanks
> to Shakeel Butt and Yosry for the reviews and discussions [1] that shaped it.
> The main change in v8 was the interface change to use memory.swap.tiers.max
> with '0' (disable) and 'max' (enable) values. This mechanism was suggested
> by Shakeel and Yosry.
>
> This change allows for future extensions to control swap between tiers and
> aligns better with existing memcg interfaces. It is confined to patch #3's
> user-facing interface; internally, patch #3 still uses the existing mask
> processing method, which is implementation-efficient.
>
> We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
> valuable feedback.
>
> Here is a brief summary of our tentative conclusions. Please correct me
> if anything is misrepresented (details in references):
>
> * Zswap tiering [2]:
>   Tiering applies only to the vswap + zswap combo. Zswap itself will
>   not be tiered, as the current architecture requires a physical device
>   for zswap allocation.

I thought we agreed that zswap should be a tier, so that proactive
zswap writeback can be implemented as proactive swap demotion?

The only restriction we talked about is that zswap cannot be the only
allowed tier as long as vswap isn't supported. We can lift the
restriction when vswap support is added.

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed @ 2026-06-22 21:21 UTC (permalink / raw)
  To: Youngjun Park, Shakeel Butt, Joshua Hahn
  Cc: akpm, chrisl, youngjun.park, linux-mm, cgroups, linux-kernel,
	kasong, hannes, mhocko, roman.gushchin, muchun.song, shikemeng,
	nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260620181635.299364-4-youngjun.park@lge.com>

On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
>
> Introduce memory.swap.tiers.max, a flat-keyed file listing each
> tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> (allowed, the default) or "0" (disabled).  A tier is one bit in the
> cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> clears that bit.
>
> Since the current use case lacks amount control, it only supports
> "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> usage, relying instead on a fast runtime bitmask check.
>
> We maintain both `mask` and `effective_mask`. The `effective_mask` is
> strictly bounded by the parent (e.g., if a parent is "0", the child's
> effective state is "0" even if its `mask` is "max"). Maintaining this
> separately avoids costly cgroup tree traversals to check ancestors at
> runtime.
>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Suggested-by: Yosry Ahmed <yosry@kernel.org>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  20 +++++
>  Documentation/mm/swap-tier.rst          |   9 +++
>  include/linux/memcontrol.h              |   5 ++
>  mm/memcontrol.c                         |  67 ++++++++++++++++
>  mm/swap_state.c                         |   5 +-
>  mm/swap_tier.c                          | 102 +++++++++++++++++++++++-
>  mm/swap_tier.h                          |  57 +++++++++++--
>  7 files changed, 255 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..4843ffcfd110 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1850,6 +1850,26 @@ The following nested keys are defined.
>         Swap usage hard limit.  If a cgroup's swap usage reaches this
>         limit, anonymous memory of the cgroup will not be swapped out.
>
> +  memory.swap.tiers.max
> +       A read-write flat-keyed file which exists on non-root
> +       cgroups.  The default is "max" for every tier.

I wonder what should the default behavior be if memory.swap.max is set
to a value other than "max". Should the limits in
memory.swap.tiers.max auto-scale or remain as "max"? We probably want
to keep the behavior consistent with memory tiering.

Shakeel/Joshua, WDYT?

^ permalink raw reply

* [PATCH v2] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Joe Simmons-Talbott @ 2026-06-22 19:43 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Shuah Khan
  Cc: Joe Simmons-Talbott, cgroups, linux-kselftest, linux-kernel

For lower HZ values a quota of 1000us is much lower than the amount
of microseconds per tick which makes the tests test_cpucg_max and
test_cpugc_max_nested fail. Use the amount of microseconds per tick
as the quota value.

Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
---
changes since v1:
- Try checking /proc/config.gz to get the actual kernel HZ value and
  fallback to 1000 if the value cannot be determined.

 tools/testing/selftests/cgroup/test_cpu.c | 33 +++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index 7a40d76b9548..65e09555309f 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -639,6 +639,29 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 	return run_cpucg_nested_weight_test(root, false);
 }
 
+/*
+ * Best effort attempt to get the kernel's HZ value from the config.
+ * Return the HZ value if found otherwise return -1 to indicate failure.
+ */
+static long
+_get_config_hz(void)
+{
+	long hz = -1;
+	FILE *f;
+	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
+
+	f = popen(cmd, "r");
+
+	if (!f)
+		goto out;
+
+	fscanf(f, "CONFIG_HZ=%ld", &hz);
+
+out:
+	pclose(f);
+	return hz;
+}
+
 /*
  * This test creates a cgroup with some maximum value within a period, and
  * verifies that a process in the cgroup is not overscheduled.
@@ -646,7 +669,8 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 static int test_cpucg_max(const char *root)
 {
 	int ret = KSFT_FAIL;
-	long quota_usec = 1000;
+	long hz = _get_config_hz();
+	long quota_usec;
 	long default_period_usec = 100000; /* cpu.max's default period */
 	long duration_seconds = 1;
 
@@ -655,6 +679,8 @@ static int test_cpucg_max(const char *root)
 	char *cpucg;
 	char quota_buf[32];
 
+	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	cpucg = cg_name(root, "cpucg_test");
@@ -710,7 +736,8 @@ static int test_cpucg_max(const char *root)
 static int test_cpucg_max_nested(const char *root)
 {
 	int ret = KSFT_FAIL;
-	long quota_usec = 1000;
+	long quota_usec;
+	long hz = _get_config_hz();
 	long default_period_usec = 100000; /* cpu.max's default period */
 	long duration_seconds = 1;
 
@@ -719,6 +746,8 @@ static int test_cpucg_max_nested(const char *root)
 	char *parent, *child;
 	char quota_buf[32];
 
+	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	parent = cg_name(root, "cpucg_parent");
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v3 5/7] kernel: Use mutable list iterators
From: Eduard Zingerman @ 2026-06-22 19:03 UTC (permalink / raw)
  To: Kaitao Cheng, Paul Moore, Eric Paris, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Tejun Heo, Johannes Weiner, Michal Koutný,
	Maarten Lankhorst, Maxime Ripard, Natalie Vock, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Peter Oberparleiter,
	Andrew Morton, Baoquan He, Mike Rapoport, Pasha Tatashin,
	Pratyush Yadav, Naveen N Rao, Josh Poimboeuf, Jiri Kosina,
	Miroslav Benes, Petr Mladek, Will Deacon, Boqun Feng,
	Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen,
	Steffen Klassert, Daniel Jordan, Rafael J. Wysocki,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Uladzislau Rezki, Juri Lelli, Vincent Guittot, Kees Cook,
	Balbir Singh, Anna-Maria Behnsen, Thomas Gleixner, John Stultz,
	KP Singh, Matt Bobrowski, Nathan Chancellor, Martin KaFai Lau,
	Song Liu, Mark Rutland, Mathieu Desnoyers, Dietmar Eggemann,
	David Vernet, Steven Rostedt
  Cc: audit, linux-kernel, bpf, netdev, cgroups, dri-devel,
	linux-perf-users, linux-trace-kernel, kexec, live-patching,
	linux-modules, linux-crypto, linux-pm, rcu, sched-ext, llvm,
	Kaitao Cheng
In-Reply-To: <20260622042811.31684-1-kaitao.cheng@linux.dev>

On Mon, 2026-06-22 at 12:28 +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The safe list iteration helpers require callers to provide a temporary
> cursor even when the cursor is only used internally by the loop. This
> leaves many functions with otherwise unused variables whose only purpose
> is to satisfy the old iterator interface.
> 
> Use the mutable list iteration helpers for those cases. The mutable
> helpers keep the same removal-safe traversal semantics, while allowing
> the temporary cursor to be internal to the macro when the caller does
> not need to observe it.
> 
> Convert list, hlist and llist users under kernel/ where the temporary
> cursor is not used outside the iteration. Keep the explicit cursor form
> where the next entry is still needed by the surrounding code.
> 
> No functional change intended.
> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---

Beside the fact that this does not apply,
I don't see a reason why is this needed for BPF sub-tree.

[...]

^ permalink raw reply

* Re: [PATCH] Docs/admin-guide/cgroup-v2: fix memory.stat doc details
From: Nhat Pham @ 2026-06-22 16:41 UTC (permalink / raw)
  To: Doehyun Baek
  Cc: Tejun Heo, Jonathan Corbet, Johannes Weiner, Michal Koutný,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Yosry Ahmed, cgroups,
	linux-doc, linux-kernel
In-Reply-To: <20260620122751.388770-1-doehyunbaek@gmail.com>

On Sat, Jun 20, 2026 at 5:28 AM Doehyun Baek <doehyunbaek@gmail.com> wrote:
>
> Fix minor cgroup v2 memory.stat documentation issues.  Correct the
> vmalloc per-node marker now that vmalloc uses the native NR_VMALLOC node
> stat, and document zswap_incomp as a byte-valued memory amount instead
> of as a page counter.
>
> Fixes: c466412c73c3 ("mm: memcontrol: switch to native NR_VMALLOC vmstat counter")
> Fixes: 5ad41a38c364 ("mm: zswap: add per-memcg stat for incompressible pages")
> Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
> -               Number of incompressible pages currently stored in zswap
> +               Amount of memory used by incompressible pages currently stored in zswap
>                 without compression. These pages could not be compressed to
>                 a size smaller than PAGE_SIZE, so they are stored as-is.
>

Good catch :)

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply

* Re: [PATCH] Docs/admin-guide/cgroup-v2: fix memory.stat doc details
From: Michal Koutný @ 2026-06-22 16:25 UTC (permalink / raw)
  To: Doehyun Baek
  Cc: Tejun Heo, Jonathan Corbet, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Yosry Ahmed, Nhat Pham, cgroups,
	linux-doc, linux-kernel
In-Reply-To: <CAN-j9Upy=thswORWaU+QxuO2i8uJKrZxcLpt5umP5QGRhpwqaQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1585 bytes --]

On Mon, Jun 22, 2026 at 05:26:53PM +0200, Doehyun Baek <doehyunbaek@gmail.com> wrote:
> However, both zswapped and zswap_incomp are memory_stats[] entries, so
> memory.stat prints them through memcg_page_state_output(). Since
> MEMCG_ZSWAP_INCOMP is not special-cased as a raw count, the stored page
> count is multiplied by the default PAGE_SIZE unit and exported as bytes.
> 
>     unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item)
>     {
>         return memcg_page_state(memcg, item) *
>         memcg_page_state_output_unit(item);
>     }

Ah, I messed up how memcg_page_state_output_unit() is used. The printed
values are amounts (in bytes).

> Separately, this matches the existing documentation style for zswapped,
> whose exported value is described as a memory amount:
> 
>     zswapped
>         Amount of application memory swapped out to zswap.
> 
> Since zswap_incomp follows the same memory.stat output path, I think its
> documentation should describe the exported value as a memory amount too.
> 
> I also boot-tested this in QEMU with the current tree and zswap enabled.
> With incompressible pages pushed into zswap, memory.stat showed:
> 
>     zswap 87822336
>     zswapped 87822336
>     zswap_incomp 87822336

Thanks for the test and for the fix!

> 
> The zswap_incomp value there is byte-valued; it is not a plain page
> count. It also matches zswapped in this all-incompressible case, which
> is consistent with both being exported as memory amounts.

Acked-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH] Docs/admin-guide/cgroup-v2: fix memory.stat doc details
From: Doehyun Baek @ 2026-06-22 15:26 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Jonathan Corbet, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Yosry Ahmed, Nhat Pham, cgroups,
	linux-doc, linux-kernel
In-Reply-To: <ajlLhFnMZGoVxLE6@localhost.localdomain>

> ...but what do you mean by this?
> As I'm looking at the code in obj_cgroup_charge_zswap() and
> memcg_page_state_output_unit(), I'd say those are pages and the docs is
> thus alright.
>
> Thanks,
> Michal

Thanks for taking a look.

I agree that the counters are pages internally. I was talking about what
gets printed in memory.stat.

The internal updates are page-count based:

    mod_memcg_state(memcg, MEMCG_ZSWAPPED, 1);
    if (size == PAGE_SIZE)
        mod_memcg_state(memcg, MEMCG_ZSWAP_INCOMP, 1);

However, both zswapped and zswap_incomp are memory_stats[] entries, so
memory.stat prints them through memcg_page_state_output(). Since
MEMCG_ZSWAP_INCOMP is not special-cased as a raw count, the stored page
count is multiplied by the default PAGE_SIZE unit and exported as bytes.

    unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item)
    {
        return memcg_page_state(memcg, item) *
        memcg_page_state_output_unit(item);
    }

Separately, this matches the existing documentation style for zswapped,
whose exported value is described as a memory amount:

    zswapped
        Amount of application memory swapped out to zswap.

Since zswap_incomp follows the same memory.stat output path, I think its
documentation should describe the exported value as a memory amount too.

I also boot-tested this in QEMU with the current tree and zswap enabled.
With incompressible pages pushed into zswap, memory.stat showed:

    zswap 87822336
    zswapped 87822336
    zswap_incomp 87822336

The zswap_incomp value there is byte-valued; it is not a plain page
count. It also matches zswapped in this all-incompressible case, which
is consistent with both being exported as memory amounts.

Best,
Doehyun Baek

^ permalink raw reply

* Re: [PATCH v6 0/6] [PATCH v6 0/6] Add reclaim to the dmem cgroup controller
From: Thadeu Lima de Souza Cascardo @ 2026-06-22 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Hellström, intel-xe, Natalie Vock, Johannes Weiner,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <ajBLAsNoKesXmFcs@slm.duckdns.org>

On Mon, Jun 15, 2026 at 08:57:06AM -1000, Tejun Heo wrote:
> On Mon, Jun 15, 2026 at 08:49:55AM -1000, Tejun Heo wrote:
> > The canonical behavior for cgroup2 would be not failing the write at all
> > even when the usage can't be brought down below the new max. Updating the
> > target configuration and tracking the current usage are separate operations.
> > The former should just set max and trigger reclaim and a writer should not
> > assume that a successful write indicates that the usage is below the written
> > max value.
> 
> Sent too early. One of the reasons is that cgroup is hierarchical and there
> can be multiple delegation layers and if you tie application of configuration
> to immediate enforcement, some hierarchical control actions become racy and
> awkward.
> 
> Here's an example: Imagine a system agent trying to lower usage in a subtree
> which contains multiple delegated containers. If max can be set below what
> reclaim can achieve immediately, it can just set the max and if the usage is
> still too high, can go around and e.g. kill some of the containers. If max
> write fails, it'd have to kill and then try again and inbetween someone else
> might push up the usage.
> 
> Thanks.
> 
> -- 
> tejun

Hi, Tejun.

As far as I understood the patchset, it doesn't fail the write if it fails
to reclaim. It sets the new max, then, if the write is blocking, starts
reclaim and eventually returns after multiple attempts. But it still
returns success.

So I believe this is behaving as you would expect.

Regards.
Cascardo.

^ permalink raw reply

* Re: [PATCH v2] cgroup/dmem: accept only one region per limit write
From: Thadeu Lima de Souza Cascardo @ 2026-06-22 15:14 UTC (permalink / raw)
  To: Eric Chanudet
  Cc: Maarten Lankhorst, Maxime Ripard, Natalie Vock, Tejun Heo,
	Johannes Weiner, Michal Koutný, cgroups, dri-devel,
	linux-kernel, Albert Esteve
In-Reply-To: <20260608-cgroup-dmem-write-single-region-v2-1-b0cd6c4ccf1b@redhat.com>

On Mon, Jun 08, 2026 at 11:53:51AM -0400, Eric Chanudet wrote:
> Accept only one "region value" pair entry for the dmem.max, dmem.min,
> dmem.low files.
> 
> This changes the UAPI that otherwise accepted multiple lines for setting
> multiple entries in one write. No existing user is known to rely on
> writing multiple regions in a single write.
> 
> Processing multiple regions in dmemcg_limit_write() could quietly change
> first limits before failing on a later one and returning an error to the
> writer, with no indication some changes occurred.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Eric Chanudet <echanude@redhat.com>

I did some review over any potential NULL derefs and tested with different
corner cases. I could not find any issues.

Thanks.
Cascardo.

Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Tested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>

> ---
> Follow up from discussions on a previous thread[1].
> If Albert's series[2] lands, I can cleanup and prepare some kunits for
> these as well.
> [1] https://lore.kernel.org/all/158bc103-7f99-4df4-8d3b-2da9b04ac0ed@lankhorst.se/
> [2] https://lore.kernel.org/all/20260519-kunit_cgroups-v4-1-f6c2f498fae4@redhat.com/
> ---
> Changes in v2:
> - Handle buf == NULL by testing !buf first after strsep (Natalie)
> - Don't allow extra spaces to separate key and value (Natalie)
>   Other cgroup files don't (rdma, misc), so stay consistent.
> - Link to v1: https://lore.kernel.org/r/20260605-cgroup-dmem-write-single-region-v1-1-9137f296579c@redhat.com
> ---
>  kernel/cgroup/dmem.c | 69 +++++++++++++++++++---------------------------------
>  1 file changed, 25 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
> index 6430c7ce1e0372f59f1313163fb7630ce49ac1ef..39930c59cb769a505a5852a5644a371fd5596f59 100644
> --- a/kernel/cgroup/dmem.c
> +++ b/kernel/cgroup/dmem.c
> @@ -734,57 +734,38 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
>  				 void (*apply)(struct dmem_cgroup_pool_state *, u64))
>  {
>  	struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of));
> -	int err = 0;
> -
> -	while (buf && !err) {
> -		struct dmem_cgroup_pool_state *pool = NULL;
> -		char *options, *region_name;
> -		struct dmem_cgroup_region *region;
> -		u64 new_limit;
> -
> -		options = buf;
> -		buf = strchr(buf, '\n');
> -		if (buf)
> -			*buf++ = '\0';
> -
> -		options = strstrip(options);
> -
> -		/* eat empty lines */
> -		if (!options[0])
> -			continue;
> -
> -		region_name = strsep(&options, " \t");
> -		if (!region_name[0])
> -			continue;
> -
> -		if (!options || !*options)
> -			return -EINVAL;
> +	struct dmem_cgroup_pool_state *pool;
> +	struct dmem_cgroup_region *region;
> +	char *region_name;
> +	u64 new_limit;
> +	int err;
>  
> -		rcu_read_lock();
> -		region = dmemcg_get_region_by_name(region_name);
> -		rcu_read_unlock();
> +	buf = strstrip(buf);
> +	region_name = strsep(&buf, " \t");
> +	if (!buf || !region_name[0])
> +		return -EINVAL;
>  
> -		if (!region)
> -			return -EINVAL;
> +	rcu_read_lock();
> +	region = dmemcg_get_region_by_name(region_name);
> +	rcu_read_unlock();
> +	if (!region)
> +		return -EINVAL;
>  
> -		err = dmemcg_parse_limit(options, &new_limit);
> -		if (err < 0)
> -			goto out_put;
> +	err = dmemcg_parse_limit(buf, &new_limit);
> +	if (err < 0)
> +		goto out_put;
>  
> -		pool = get_cg_pool_unlocked(dmemcs, region);
> -		if (IS_ERR(pool)) {
> -			err = PTR_ERR(pool);
> -			goto out_put;
> -		}
> +	pool = get_cg_pool_unlocked(dmemcs, region);
> +	if (IS_ERR(pool)) {
> +		err = PTR_ERR(pool);
> +		goto out_put;
> +	}
>  
> -		/* And commit */
> -		apply(pool, new_limit);
> -		dmemcg_pool_put(pool);
> +	apply(pool, new_limit);
> +	dmemcg_pool_put(pool);
>  
>  out_put:
> -		kref_put(&region->ref, dmemcg_free_region);
> -	}
> -
> +	kref_put(&region->ref, dmemcg_free_region);
>  
>  	return err ?: nbytes;
>  }
> 
> ---
> base-commit: 640c57d6ca1346a1c2363a3f473b405af979e046
> change-id: 20260605-cgroup-dmem-write-single-region-9bf05b6d995d
> 
> Best regards,
> -- 
> Eric Chanudet <echanude@redhat.com>
> 

^ permalink raw reply

* Re: [PATCH] Docs/admin-guide/cgroup-v2: fix memory.stat doc details
From: Michal Koutný @ 2026-06-22 14:59 UTC (permalink / raw)
  To: Doehyun Baek
  Cc: Tejun Heo, Jonathan Corbet, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Yosry Ahmed, Nhat Pham, cgroups,
	linux-doc, linux-kernel
In-Reply-To: <20260620122751.388770-1-doehyunbaek@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

On Sat, Jun 20, 2026 at 12:27:51PM +0000, Doehyun Baek <doehyunbaek@gmail.com> wrote:
> Fix minor cgroup v2 memory.stat documentation issues.  Correct the
> vmalloc per-node marker now that vmalloc uses the native NR_VMALLOC node
> stat, and document zswap_incomp as a byte-valued memory amount instead
> of as a page counter.
> 
> Fixes: c466412c73c3 ("mm: memcontrol: switch to native NR_VMALLOC vmstat counter")
> Fixes: 5ad41a38c364 ("mm: zswap: add per-memcg stat for incompressible pages")
> Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 993446ab66d0..ce6741f78f4f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1570,7 +1570,7 @@ The following nested keys are defined.
>  	  sock (npn)
>  		Amount of memory used in network transmission buffers
>  
> -	  vmalloc (npn)
> +	  vmalloc
>  		Amount of memory used for vmap backed memory.

The vmalloc change looks OK...

>  
>  	  shmem
> @@ -1735,7 +1735,7 @@ The following nested keys are defined.
>  		Number of pages written from zswap to swap.
>  
>  	  zswap_incomp
> -		Number of incompressible pages currently stored in zswap
> +		Amount of memory used by incompressible pages currently stored in zswap
>  		without compression. These pages could not be compressed to
>  		a size smaller than PAGE_SIZE, so they are stored as-is.

...but what do you mean by this?
As I'm looking at the code in obj_cgroup_charge_zswap() and
memcg_page_state_output_unit(), I'd say those are pages and the docs is
thus alright.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [Kernel Bug] INFO: task hung in cgroup_drain_dying
From: Longxing Li @ 2026-06-22 13:57 UTC (permalink / raw)
  To: Michal Koutný; +Cc: syzkaller, tj, hannes, cgroups, linux-kernel
In-Reply-To: <ailmCVEqnnQZ7ClA@localhost.localdomain>

Hi Michal,
  Thank you for the reply.
  We will check this problem on the latest kernel to see if it's been
fixed, and will report back if we find any more valuable things to
discuss.

  Best regards,
  Longxing Li

Michal Koutný <mkoutny@suse.com> 于2026年6月10日周三 21:28写道：
>
> On Wed, Jun 10, 2026 at 03:11:41PM +0800, Longxing Li <coregee2000@gmail.com> wrote:
> > sorry for not containing full information in last email. the config[1]
> > and report[2] are as follows. CONFIG_PROVE_LOCKING is not enabled in
> > our config.
>
> Thanks.
>
> > INFO: task systemd:1 blocked for more than 143 seconds.
> >       Not tainted 7.0.6 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:systemd         state:D stack:20616 pid:1     tgid:1     ppid:0
> >    task_flags:0x400100 flags:0x00080001
> > Call Trace:
> >  <TASK>
> >  context_switch kernel/sched/core.c:5298 [inline]
> >  __schedule+0x1006/0x5f00 kernel/sched/core.c:6911
> >  __schedule_loop kernel/sched/core.c:6993 [inline]
> >  schedule+0xe7/0x3a0 kernel/sched/core.c:7008
> >  cgroup_drain_dying+0x1ed/0x360 kernel/cgroup/cgroup.c:6294
> >  cgroup_rmdir+0x38/0x300 kernel/cgroup/cgroup.c:6309
> >  kernfs_iop_rmdir+0x10a/0x180 fs/kernfs/dir.c:1311
> >  vfs_rmdir fs/namei.c:5344 [inline]
> >  vfs_rmdir+0x340/0x860 fs/namei.c:5317
> >  filename_rmdir+0x3be/0x510 fs/namei.c:5399
> >  __do_sys_rmdir fs/namei.c:5422 [inline]
> >  __se_sys_rmdir fs/namei.c:5419 [inline]
> >  __x64_sys_rmdir+0x47/0x90 fs/namei.c:5419
> >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >  do_syscall_64+0x11b/0xf80 arch/x86/entry/syscall_64.c:94
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Hm, hm, this kinds fits 93618edf75383 ("cgroup: Defer css percpu_ref
> kill on rmdir until cgroup is depopulated")
> which got into stable 7.0.9.
> Can you reproduce even with that (or newer) kernel?
>
> Michal

^ permalink raw reply

* Re: [Kernel Bug] INFO: rcu detected stall in count_memcg_event_mm
From: Longxing Li @ 2026-06-22 13:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: syzkaller, hannes, mhocko, roman.gushchin, muchun.song, cgroups,
	linux-mm, linux-kernel
In-Reply-To: <airuQ36pKQ8x8CZk@linux.dev>

  Hi Shakeel,

  Thank you for the reply.

  This seems like a CPU starvation scenario. The reproduction triggers
extreme CPU contention where rcu_preempt kthread cannot get sufficient
CPU time to complete grace periods.
  1. The stall occurs in count_memcg_event_mm / get_obj_cgroup_from_current
  2. Both paths correctly hold RCU read-side locks
  3. The deadlock is: page fault → dentry allocation → memcg kmem
tracking → RCU lookup → need RCU grace period → rcu_preempt starved
  4. This may require extreme stress to reproduce, and can be random

 We will try to reproduce with the latest kernel and report back if we
find any specific deficiency.

  Best regards,
  Longxing Li

Shakeel Butt <shakeel.butt@linux.dev> 于2026年6月12日周五 01:21写道：
>
> Hi Longxing,
>
> Thanks for the report.
>
> On Tue, Jun 09, 2026 at 07:57:56PM +0800, Longxing Li wrote:
> > Dear Linux kernel developers and maintainers,
> >
> > We would like to report a new kernel bug found by our tool. INFO: rcu
> > detected stall in count_memcg_event_mm. Details are as follows.
> >
> > Kernel commit: v5.15.189
>
> This is an old kernel, can you reproduce with the latest kernel? Also at the
> high level this seems like CPU starvation. Can you also describe your system and
> the workload/test which is trigerring this issue?
>

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-22 12:31 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <ajPS3AKrZEbZbXBw@gourry-fedora-PF4VCD3F>

On 6/18/26 13:13, Gregory Price wrote:
> On Thu, Jun 18, 2026 at 10:21:30AM +0200, Vlastimil Babka (SUSE) wrote:
>> On 6/15/26 17:37, Gregory Price wrote:
>> > 
>> > One thought would be a way to switch what fallback list is used, and
>> > then have specific fallback lists for certain contexts.
>> > 
>> > Right now there is a single example of this: __GFP_THISNODE
>> >   |= __GFP_THISNODE   =>  NOFALLBACK
>> >   &= ~__GFP_THISNODE  =>  FALLBACK
>> > 
>> > We could add an interface with the desired fallback list based as an
>> > argument, and let get_page_from_freelist to prefer that over the default
>> > global lists.
>> 
>> Does it mean a new argument in a number of functions in the page allocator,
>> or can it be mapped to alloc_flags (at least internally?), because the
>> number of possible fallback lists is small enough?
>>
> 
> What I ended up with was adding a single page_alloc.c external interface
> that allows you define the zonelist via an enum, and then an internal
> selector resolution in prepare_alloc_pages() stored in alloc_context

OK. Since it's in alloc_context then there should be no parameter bloat
inside page allocator. And for the single external entry point it's better
to be explicit.

> 
> eg:
> 
> static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask,
>                 struct alloc_context *ac, gfp_t *alloc_gfp,
>                 unsigned int *alloc_flags)
> {       
>         ac->highest_zoneidx = gfp_zone(gfp_mask);
>         ac->zonelist = select_zonelist(preferred_nid, gfp_mask, ac->zlsel);
> 	... snip ...
> }
> 
> struct folio *__folio_alloc_zonelist_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask,
>                 enum alloc_zonelist zlsel);
> 
> 
> The original __folio_alloc* functions just add a DEFAULT - which tells
> select_zonelist() to base the decision on __GFP_THISNODE.
> 
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         return __folio_alloc_core(gfp, order, preferred_nid, nodemask,
>                                   ALLOC_ZONELIST_DEFAULT);
> }
> EXPORT_SYMBOL(__folio_alloc_noprof);
> 
> 
> This does a few things
>   - The isolation is structural, there is no way to accidentally
>     allocate private memory without passing ALLOC_ZONELIST_PRIVATE
> 
>   - The isolation forces folios - there are no non-folio interfaces
>     which allow zonelist selection
> 
>   - The zonelist selection is confined to this allocation context,
>     so no inheritence is possible.
> 

Ack.

> 
> I tried to avoid using an ALLOC_ flag so we can avoid yet another flag
> crunch, but there certainly are few enough zonelists that we could
> encode it there and expose it.  I know Brendan was looking at plumbing
> alloc flags out to an interface, so i'm open to that.
> 
> Externally the way I determine what zonelist to use is a lookup based on
> reason - letting the node filter.  This is really only needed in a
> couple spots:
> 
> mm/khugepaged.c:  enum alloc_zonelist zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_RECLAIM);
> mm/vmscan.c:      mtc->zlsel = alloc_zonelist_for_nodemask(mtc->nmask, NODE_ALLOC_TIERING);
> mm/migrate.c:     .zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_USER_MIGRATE),
> 
> static inline enum alloc_zonelist
> alloc_zonelist_for_node(int nid, enum node_alloc_reason reason)
> {
>         bool ok;
> 
>         if (!node_state(nid, N_MEMORY_PRIVATE))
>                 return ALLOC_ZONELIST_DEFAULT;
>         switch (reason) {
>         case NODE_ALLOC_RECLAIM:
>                 ok = node_is_reclaimable(nid);
>                 break;
>         case NODE_ALLOC_TIERING:
>                 ok = node_allows_tiering(nid);
>                 break;
>         case NODE_ALLOC_USER_MIGRATE:
>                 ok = node_allows_user_migrate(nid);
>                 break;
>         default:
>                 ok = false;
>         }
>         return ok ? ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
> }
> 
> Otherwise... everything is now a mempolicy w/ MPOL_F_BIND and all the
> handling goes through the normal fault-paths :]
> 
> static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
>                 struct mempolicy *pol, pgoff_t ilx, int nid)
> {
>         nodemask_t *nodemask;
>         struct page *page;
>         enum alloc_zonelist zlsel = (pol->flags & MPOL_F_PRIVATE) ?
>                 ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
> ...
>         if (pol->mode == MPOL_PREFERRED_MANY)
>                 return alloc_pages_preferred_many(gfp, order, nid, nodemask,
>                                                   zlsel);
> ...
> }
> 
> 
> Switching to an alloc_flag would probably be trivially if that's really
> wanted

I guess not. Thanks for the explanation!

> ~Gregory


^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: David Hildenbrand (Arm) @ 2026-06-22 11:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaitao Cheng
  Cc: Andrew Morton, Jens Axboe, Tejun Heo, Alexander Viro,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao
In-Reply-To: <CAADnVQJmPWFT01b7DuLdtafv=8FyB84GYHNZ8zSTck+9Aw0JpA@mail.gmail.com>

On 6/22/26 07:28, Alexei Starovoitov wrote:
> On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>
>> From: chengkaitao <chengkaitao@kylinos.cn>
>>
>> The list_for_each*_safe() helpers are used when the loop body may remove
>> the current entry.  Their current interface, however, forces every caller
>> to define a temporary cursor outside the macro and pass it in, even when
>> the caller never uses that cursor directly.  For most call sites this
>> extra cursor is just boilerplate required by the macro implementation.
>>
>> This is awkward because the saved next pointer is an internal detail of
>> the iteration.  Callers that only remove or move the current entry do not
>> need to spell it out.
>>
>> The _safe() suffix has also caused confusion.  Christian Koenig pointed
>> out that the name is easy to read as a thread-safe variant, especially
>> for beginners, even though it only means that the iterator keeps enough
>> state to tolerate removal of the current entry.  He suggested _mutable()
>> as a clearer description of what the loop permits.
>>
>> Add *_mutable() iterator variants for list, hlist and llist.  The new
>> helpers are variadic and support both forms.  In the common case, the
>> caller omits the temporary cursor and the macro creates a unique internal
>> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
>> explicit temporary cursor, the caller can still pass it and the helper
>> keeps the existing *_safe() behaviour.
>>
>> For example, a call site may use the shorter form:
>>
>>   list_for_each_entry_mutable(pos, head, member)
>>
>> or keep the explicit temporary cursor form:
>>
>>   list_for_each_entry_mutable(pos, tmp, head, member)
>>
>> The existing *_safe() helpers remain available for compatibility.  This
>> series only converts users in mm, block, kernel, init and io_uring.  If
>> this approach looks acceptable, the remaining users can be converted in
>> follow-up series.
>>
>> Changes in v3 (Christian König, Andy Shevchenko):
>> - Convert safe list walks to mutable iterators
>>
>> Changes in v2 (Muchun Song, Andy Shevchenko):
>> - Drop the list_for_each_entry_mutable*() helpers from v1 and make the
>>   cursor change directly in the existing list_for_each_entry*() helpers.
>> - Open-code special list walks that rely on updating the loop cursor in
>>   the body, preserving their existing traversal semantics.
>>
>> Link to v2:
>> https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
>>
>> Link to v1:
>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>
>> Kaitao Cheng (7):
>>   list: Add mutable iterator variants
>>   llist: Add mutable iterator variants
>>   mm: Use mutable list iterators
>>   block: Use mutable list iterators
>>   kernel: Use mutable list iterators
>>   initramfs: Use mutable list iterator
>>   io_uring: Use mutable list iterators
>>
>>  block/bfq-iosched.c                 |  17 +-
>>  block/blk-cgroup.c                  |  12 +-
>>  block/blk-flush.c                   |   4 +-
>>  block/blk-iocost.c                  |  18 +-
>>  block/blk-mq.c                      |   8 +-
>>  block/blk-throttle.c                |   4 +-
>>  block/kyber-iosched.c               |   4 +-
>>  block/partitions/ldm.c              |   8 +-
>>  block/sed-opal.c                    |   4 +-
>>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>>  include/linux/llist.h               |  81 +++++++--
>>  init/initramfs.c                    |   5 +-
>>  io_uring/cancel.c                   |   6 +-
>>  io_uring/poll.c                     |   3 +-
>>  io_uring/rw.c                       |   4 +-
>>  io_uring/timeout.c                  |   8 +-
>>  io_uring/uring_cmd.c                |   3 +-
>>  kernel/audit_tree.c                 |   4 +-
>>  kernel/audit_watch.c                |  16 +-
>>  kernel/auditfilter.c                |   4 +-
>>  kernel/auditsc.c                    |   4 +-
>>  kernel/bpf/arena.c                  |  10 +-
>>  kernel/bpf/arraymap.c               |   8 +-
>>  kernel/bpf/bpf_local_storage.c      |   3 +-
>>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>>  kernel/bpf/btf.c                    |  18 +-
>>  kernel/bpf/cgroup.c                 |   7 +-
>>  kernel/bpf/cpumap.c                 |   4 +-
>>  kernel/bpf/devmap.c                 |  10 +-
>>  kernel/bpf/helpers.c                |   8 +-
>>  kernel/bpf/local_storage.c          |   4 +-
>>  kernel/bpf/memalloc.c               |  16 +-
>>  kernel/bpf/offload.c                |   8 +-
>>  kernel/bpf/states.c                 |   4 +-
>>  kernel/bpf/stream.c                 |   4 +-
>>  kernel/bpf/verifier.c               |   6 +-
>>  kernel/cgroup/cgroup-v1.c           |   4 +-
>>  kernel/cgroup/cgroup.c              |  54 +++---
>>  kernel/cgroup/dmem.c                |  12 +-
>>  kernel/cgroup/rdma.c                |   8 +-
>>  kernel/events/core.c                |  44 +++--
>>  kernel/events/uprobes.c             |  12 +-
>>  kernel/exit.c                       |   8 +-
>>  kernel/fail_function.c              |   4 +-
>>  kernel/gcov/clang.c                 |   4 +-
>>  kernel/irq_work.c                   |   4 +-
>>  kernel/kexec_core.c                 |   4 +-
>>  kernel/kprobes.c                    |  16 +-
>>  kernel/livepatch/core.c             |   4 +-
>>  kernel/livepatch/core.h             |   4 +-
>>  kernel/liveupdate/kho_block.c       |   4 +-
>>  kernel/liveupdate/luo_flb.c         |   4 +-
>>  kernel/locking/rwsem.c              |   2 +-
>>  kernel/locking/test-ww_mutex.c      |   2 +-
>>  kernel/module/main.c                |  11 +-
>>  kernel/padata.c                     |   4 +-
>>  kernel/power/snapshot.c             |   8 +-
>>  kernel/power/wakelock.c             |   4 +-
>>  kernel/printk/printk.c              |  11 +-
>>  kernel/ptrace.c                     |   4 +-
>>  kernel/rcu/rcutorture.c             |   3 +-
>>  kernel/rcu/tasks.h                  |   9 +-
>>  kernel/rcu/tree.c                   |   6 +-
>>  kernel/resource.c                   |   4 +-
>>  kernel/sched/core.c                 |   4 +-
>>  kernel/sched/ext.c                  |  22 +--
>>  kernel/sched/fair.c                 |  28 +--
>>  kernel/sched/topology.c             |   4 +-
>>  kernel/sched/wait.c                 |   4 +-
>>  kernel/seccomp.c                    |   4 +-
>>  kernel/signal.c                     |  11 +-
>>  kernel/smp.c                        |   4 +-
>>  kernel/taskstats.c                  |   8 +-
>>  kernel/time/clockevents.c           |   6 +-
>>  kernel/time/clocksource.c           |   4 +-
>>  kernel/time/posix-cpu-timers.c      |   4 +-
>>  kernel/time/posix-timers.c          |   3 +-
>>  kernel/torture.c                    |   3 +-
>>  kernel/trace/bpf_trace.c            |   4 +-
>>  kernel/trace/ftrace.c               |  49 +++--
>>  kernel/trace/ring_buffer.c          |  25 ++-
>>  kernel/trace/trace.c                |  12 +-
>>  kernel/trace/trace_dynevent.c       |   6 +-
>>  kernel/trace/trace_dynevent.h       |   5 +-
>>  kernel/trace/trace_events.c         |  35 ++--
>>  kernel/trace/trace_events_filter.c  |   4 +-
>>  kernel/trace/trace_events_hist.c    |   8 +-
>>  kernel/trace/trace_events_trigger.c |  17 +-
>>  kernel/trace/trace_events_user.c    |  16 +-
>>  kernel/trace/trace_stat.c           |   4 +-
>>  kernel/user-return-notifier.c       |   3 +-
>>  kernel/workqueue.c                  |  16 +-
>>  mm/backing-dev.c                    |   8 +-
>>  mm/balloon.c                        |   8 +-
>>  mm/cma.c                            |   4 +-
>>  mm/compaction.c                     |   4 +-
>>  mm/damon/core.c                     |   4 +-
>>  mm/damon/sysfs-schemes.c            |   4 +-
>>  mm/dmapool.c                        |   4 +-
>>  mm/huge_memory.c                    |   8 +-
>>  mm/hugetlb.c                        |  56 +++---
>>  mm/hugetlb_vmemmap.c                |  16 +-
>>  mm/khugepaged.c                     |  14 +-
>>  mm/kmemleak.c                       |   7 +-
>>  mm/ksm.c                            |  25 +--
>>  mm/list_lru.c                       |   4 +-
>>  mm/memcontrol-v1.c                  |   8 +-
>>  mm/memory-failure.c                 |  12 +-
>>  mm/memory-tiers.c                   |   4 +-
>>  mm/migrate.c                        |  23 ++-
>>  mm/mmu_notifier.c                   |   9 +-
>>  mm/page_alloc.c                     |   8 +-
>>  mm/page_reporting.c                 |   2 +-
>>  mm/percpu.c                         |  11 +-
>>  mm/pgtable-generic.c                |   4 +-
>>  mm/rmap.c                           |  10 +-
>>  mm/shmem.c                          |   9 +-
>>  mm/slab_common.c                    |  14 +-
>>  mm/slub.c                           |  33 ++--
>>  mm/swapfile.c                       |   4 +-
>>  mm/userfaultfd.c                    |  12 +-
>>  mm/vmalloc.c                        |  24 +--
>>  mm/vmscan.c                         |   7 +-
>>  mm/zsmalloc.c                       |   4 +-
>>  124 files changed, 875 insertions(+), 681 deletions(-)
> 
> Not sure what you were thinking, but this diff stat
> is not landable.

Agreed. If we decide we want this, I guess we should target per-subsystem
conversions.

If this goes through the MM tree, I would even appreciate doing this on a per-MM
component granularity.

(unless we have some magic "Linus converts all of them" script, which I doubt we
will have)

Is there a way forward to replace list_for_each_*_safe entirely, possibly just
reusing the old name but simply the parameter?

-- 
Cheers,

David

^ permalink raw reply

* [PATCH] cgroup: Fix a typo of the function name in comment
From: Zenghui Yu @ 2026-06-22 11:07 UTC (permalink / raw)
  To: cgroups, linux-kernel; +Cc: tj, hannes, mkoutny, Zenghui Yu (Huawei)

From: "Zenghui Yu (Huawei)" <zenghui.yu@linux.dev>

... which was wrongly written as cgroup_threadcgroup_change_begin().

Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
---
 include/linux/cgroup-defs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index de2cd6238c2a..7a631a257613 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -896,7 +896,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups
  * @tsk: target task
  *
- * Counterpart of cgroup_threadcgroup_change_begin().
+ * Counterpart of cgroup_threadgroup_change_begin().
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-22 10:46 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Alexei Starovoitov, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao, Muchun Song
In-Reply-To: <8c8f1849-86d3-4c69-be27-30bbdffdf616@linux.dev>

On Mon, Jun 22, 2026 at 02:15:01PM +0800, Kaitao Cheng wrote:
> 在 2026/6/22 13:28, Alexei Starovoitov 写道:
> > On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

...

> >>  block/bfq-iosched.c                 |  17 +-
> >>  block/blk-cgroup.c                  |  12 +-
> >>  block/blk-flush.c                   |   4 +-
> >>  block/blk-iocost.c                  |  18 +-
> >>  block/blk-mq.c                      |   8 +-
> >>  block/blk-throttle.c                |   4 +-
> >>  block/kyber-iosched.c               |   4 +-
> >>  block/partitions/ldm.c              |   8 +-
> >>  block/sed-opal.c                    |   4 +-
> >>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
> >>  include/linux/llist.h               |  81 +++++++--
> >>  init/initramfs.c                    |   5 +-
> >>  io_uring/cancel.c                   |   6 +-
> >>  io_uring/poll.c                     |   3 +-
> >>  io_uring/rw.c                       |   4 +-
> >>  io_uring/timeout.c                  |   8 +-
> >>  io_uring/uring_cmd.c                |   3 +-
> >>  kernel/audit_tree.c                 |   4 +-
> >>  kernel/audit_watch.c                |  16 +-
> >>  kernel/auditfilter.c                |   4 +-
> >>  kernel/auditsc.c                    |   4 +-
> >>  kernel/bpf/arena.c                  |  10 +-
> >>  kernel/bpf/arraymap.c               |   8 +-
> >>  kernel/bpf/bpf_local_storage.c      |   3 +-
> >>  kernel/bpf/bpf_lru_list.c           |  25 ++-
> >>  kernel/bpf/btf.c                    |  18 +-
> >>  kernel/bpf/cgroup.c                 |   7 +-
> >>  kernel/bpf/cpumap.c                 |   4 +-
> >>  kernel/bpf/devmap.c                 |  10 +-
> >>  kernel/bpf/helpers.c                |   8 +-
> >>  kernel/bpf/local_storage.c          |   4 +-
> >>  kernel/bpf/memalloc.c               |  16 +-
> >>  kernel/bpf/offload.c                |   8 +-
> >>  kernel/bpf/states.c                 |   4 +-
> >>  kernel/bpf/stream.c                 |   4 +-
> >>  kernel/bpf/verifier.c               |   6 +-
> >>  kernel/cgroup/cgroup-v1.c           |   4 +-
> >>  kernel/cgroup/cgroup.c              |  54 +++---
> >>  kernel/cgroup/dmem.c                |  12 +-
> >>  kernel/cgroup/rdma.c                |   8 +-
> >>  kernel/events/core.c                |  44 +++--
> >>  kernel/events/uprobes.c             |  12 +-
> >>  kernel/exit.c                       |   8 +-
> >>  kernel/fail_function.c              |   4 +-
> >>  kernel/gcov/clang.c                 |   4 +-
> >>  kernel/irq_work.c                   |   4 +-
> >>  kernel/kexec_core.c                 |   4 +-
> >>  kernel/kprobes.c                    |  16 +-
> >>  kernel/livepatch/core.c             |   4 +-
> >>  kernel/livepatch/core.h             |   4 +-
> >>  kernel/liveupdate/kho_block.c       |   4 +-
> >>  kernel/liveupdate/luo_flb.c         |   4 +-
> >>  kernel/locking/rwsem.c              |   2 +-
> >>  kernel/locking/test-ww_mutex.c      |   2 +-
> >>  kernel/module/main.c                |  11 +-
> >>  kernel/padata.c                     |   4 +-
> >>  kernel/power/snapshot.c             |   8 +-
> >>  kernel/power/wakelock.c             |   4 +-
> >>  kernel/printk/printk.c              |  11 +-
> >>  kernel/ptrace.c                     |   4 +-
> >>  kernel/rcu/rcutorture.c             |   3 +-
> >>  kernel/rcu/tasks.h                  |   9 +-
> >>  kernel/rcu/tree.c                   |   6 +-
> >>  kernel/resource.c                   |   4 +-
> >>  kernel/sched/core.c                 |   4 +-
> >>  kernel/sched/ext.c                  |  22 +--
> >>  kernel/sched/fair.c                 |  28 +--
> >>  kernel/sched/topology.c             |   4 +-
> >>  kernel/sched/wait.c                 |   4 +-
> >>  kernel/seccomp.c                    |   4 +-
> >>  kernel/signal.c                     |  11 +-
> >>  kernel/smp.c                        |   4 +-
> >>  kernel/taskstats.c                  |   8 +-
> >>  kernel/time/clockevents.c           |   6 +-
> >>  kernel/time/clocksource.c           |   4 +-
> >>  kernel/time/posix-cpu-timers.c      |   4 +-
> >>  kernel/time/posix-timers.c          |   3 +-
> >>  kernel/torture.c                    |   3 +-
> >>  kernel/trace/bpf_trace.c            |   4 +-
> >>  kernel/trace/ftrace.c               |  49 +++--
> >>  kernel/trace/ring_buffer.c          |  25 ++-
> >>  kernel/trace/trace.c                |  12 +-
> >>  kernel/trace/trace_dynevent.c       |   6 +-
> >>  kernel/trace/trace_dynevent.h       |   5 +-
> >>  kernel/trace/trace_events.c         |  35 ++--
> >>  kernel/trace/trace_events_filter.c  |   4 +-
> >>  kernel/trace/trace_events_hist.c    |   8 +-
> >>  kernel/trace/trace_events_trigger.c |  17 +-
> >>  kernel/trace/trace_events_user.c    |  16 +-
> >>  kernel/trace/trace_stat.c           |   4 +-
> >>  kernel/user-return-notifier.c       |   3 +-
> >>  kernel/workqueue.c                  |  16 +-
> >>  mm/backing-dev.c                    |   8 +-
> >>  mm/balloon.c                        |   8 +-
> >>  mm/cma.c                            |   4 +-
> >>  mm/compaction.c                     |   4 +-
> >>  mm/damon/core.c                     |   4 +-
> >>  mm/damon/sysfs-schemes.c            |   4 +-
> >>  mm/dmapool.c                        |   4 +-
> >>  mm/huge_memory.c                    |   8 +-
> >>  mm/hugetlb.c                        |  56 +++---
> >>  mm/hugetlb_vmemmap.c                |  16 +-
> >>  mm/khugepaged.c                     |  14 +-
> >>  mm/kmemleak.c                       |   7 +-
> >>  mm/ksm.c                            |  25 +--
> >>  mm/list_lru.c                       |   4 +-
> >>  mm/memcontrol-v1.c                  |   8 +-
> >>  mm/memory-failure.c                 |  12 +-
> >>  mm/memory-tiers.c                   |   4 +-
> >>  mm/migrate.c                        |  23 ++-
> >>  mm/mmu_notifier.c                   |   9 +-
> >>  mm/page_alloc.c                     |   8 +-
> >>  mm/page_reporting.c                 |   2 +-
> >>  mm/percpu.c                         |  11 +-
> >>  mm/pgtable-generic.c                |   4 +-
> >>  mm/rmap.c                           |  10 +-
> >>  mm/shmem.c                          |   9 +-
> >>  mm/slab_common.c                    |  14 +-
> >>  mm/slub.c                           |  33 ++--
> >>  mm/swapfile.c                       |   4 +-
> >>  mm/userfaultfd.c                    |  12 +-
> >>  mm/vmalloc.c                        |  24 +--
> >>  mm/vmscan.c                         |   7 +-
> >>  mm/zsmalloc.c                       |   4 +-
> >>  124 files changed, 875 insertions(+), 681 deletions(-)
> > 
> > Not sure what you were thinking, but this diff stat
> > is not landable.
> 
> [PATCH v3 1/7] and [PATCH v3 2/7] contain the main logic and can
> be merged directly. They are also compatible with the old API.
> [PATCH v3 3/7] through [PATCH v3 7/7] are just simple interface
> replacements and do not change any functional logic. They can be
> left unmerged for now; individual modules can pick them up later
> if needed.
> 
> In v2, Andy Shevchenko mentioned: "If it's done by Linus himself
> during the day when he prepares -rc1, it's fine."

Yes, but you need to get his blessing first to go with this.
Have you communicated with him on this?

> Even so, the
> changes in this patch series are indeed quite large and touch
> almost every subsystem. I have only converted part of them for
> now, so I wanted to send this out first and see what people think.

That's why it's better to provide a script to convert (e.g., coccinelle)
instead of tons of patches.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH] block/cgroup: Drop stale -EBUSY retry from blkg_conf_prep()
From: Yang Xiuwei @ 2026-06-22  8:56 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe; +Cc: cgroups, linux-block, Yang Xiuwei

Since commit 8f4236d9008b ("block: remove QUEUE_FLAG_BYPASS and
->bypass") nothing in the blkcg blkg lookup/creation path
returns -EBUSY anymore. blkg_conf_prep() nevertheless still
retries at fail_exit with msleep(10) and restart_syscall()
— logic added in 2012 when blk_queue_bypass() could
cause blkg lookup/creation to fail with -EBUSY while the queue was
temporarily bypassed during elevator changes.

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
---
 block/blk-cgroup.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3093c1c03902..259f2240e7df 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -919,16 +919,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	spin_unlock_irq(&q->queue_lock);
 fail_exit:
 	mutex_unlock(&q->blkcg_mutex);
-	/*
-	 * If queue was bypassing, we should retry.  Do so after a
-	 * short msleep().  It isn't strictly necessary but queue
-	 * can be bypassing for some time and it's always nice to
-	 * avoid busy looping.
-	 */
-	if (ret == -EBUSY) {
-		msleep(10);
-		ret = restart_syscall();
-	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkg_conf_prep);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Christian König @ 2026-06-22  8:51 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On 6/22/26 06:05, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

That sounds like a bad idea to me. The macro should really be doing one job and that as best as it can.

> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.

This can be perfectly used for code that which really needs the separate variable for the next entry.

Regards,
Christian.


> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_prev_safe is an old interface, use list_for_each_prev_mutable instead.
>   */
>  #define list_for_each_prev_safe(pos, n, head) \
>  	for (pos = (head)->prev, n = pos->prev; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->prev)
>  
> +#define __list_for_each_prev_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->prev)->prev;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->prev)
> +
> +#define __list_for_each_prev_mutable1(pos, head)			\
> +	__list_for_each_prev_mutable_internal(pos, __UNIQUE_ID(prev), head)
> +
> +#define __list_for_each_prev_mutable2(pos, prev, head)			\
> +	list_for_each_prev_safe(pos, prev, head)
> +
> +/**
> + * list_for_each_prev_mutable - iterate over a list backwards safe against entry removal
> + * @pos:	the &struct list_head to use as a loop cursor.
> + * @...:	either (head) or (prev, head)
> + *
> + * prev:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_prev_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_prev_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
>  /**
>   * list_count_nodes - count nodes in the list
>   * @head:	the head for your list.
> @@ -895,12 +940,8 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	for (; !list_entry_is_head(pos, head, member);			\
>  	     pos = list_prev_entry(pos, member))
>  
> -/**
> - * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> +/*
> + * list_for_each_entry_safe is an old interface, use list_for_each_entry_mutable instead.
>   */
>  #define list_for_each_entry_safe(pos, n, head, member)			\
>  	for (pos = list_first_entry(head, typeof(*pos), member),	\
> @@ -908,15 +949,36 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_first_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable2(pos, head, member)		\
> +	__list_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable3(pos, next, head, member)		\
> +	list_for_each_entry_safe(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_continue - continue list iteration safe against removal
> + * list_for_each_entry_mutable - iterate over a list safe against entry removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type, continuing after current point,
> - * safe against removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + */
> +#define list_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_entry_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_continue is an old interface,
> + * use list_for_each_entry_mutable_continue instead.
>   */
>  #define list_for_each_entry_safe_continue(pos, n, head, member) 		\
>  	for (pos = list_next_entry(pos, member), 				\
> @@ -924,30 +986,79 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_continue_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_next_entry(pos, member), member);			\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_continue2(pos, head, member)	\
> +	__list_for_each_entry_mutable_continue_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_continue3(pos, next, head, member) \
> +	list_for_each_entry_safe_continue(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_from - iterate over list from current point safe against removal
> + * list_for_each_entry_mutable_continue - continue list iteration safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type from current point, safe against
> - * removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type, continuing after current point,
> + * safe against removal of list entry.
> + */
> +#define list_for_each_entry_mutable_continue(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_continue,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_from is an old interface,
> + * use list_for_each_entry_mutable_from instead.
>   */
>  #define list_for_each_entry_safe_from(pos, n, head, member) 			\
>  	for (n = list_next_entry(pos, member);					\
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_from_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos, member);		\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_from2(pos, head, member)		\
> +	__list_for_each_entry_mutable_from_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_from3(pos, next, head, member)	\
> +	list_for_each_entry_safe_from(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
> + * list_for_each_entry_mutable_from - iterate over list from current point safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate backwards over list of given type, safe against removal
> - * of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type from current point, safe against
> + * removal of list entry.
> + */
> +#define list_for_each_entry_mutable_from(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_from,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_reverse is an old interface,
> + * use list_for_each_entry_mutable_reverse instead.
>   */
>  #define list_for_each_entry_safe_reverse(pos, n, head, member)		\
>  	for (pos = list_last_entry(head, typeof(*pos), member),		\
> @@ -955,6 +1066,37 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_prev_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_reverse_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_prev_entry(pos =			\
> +		list_last_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_prev_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_reverse2(pos, head, member)	\
> +	__list_for_each_entry_mutable_reverse_internal(pos,		\
> +		__UNIQUE_ID(prev), head, member)
> +
> +#define __list_for_each_entry_mutable_reverse3(pos, prev, head, member)	\
> +	list_for_each_entry_safe_reverse(pos, prev, head, member)
> +
> +/**
> + * list_for_each_entry_mutable_reverse - iterate backwards over list safe against removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (prev, head, member)
> + *
> + * prev:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate backwards over list of given type, safe against removal
> + * of list entry.
> + */
> +#define list_for_each_entry_mutable_reverse(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_reverse,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
>   * @pos:	the loop cursor used in the list_for_each_entry_safe loop
> @@ -1189,6 +1331,31 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
>  	     pos = n)
>  
> +#define __hlist_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->first) ? pos->next : NULL; \
> +	     pos;							\
> +	     pos = tmp, tmp = pos ? pos->next : NULL)
> +
> +#define __hlist_for_each_mutable1(pos, head)				\
> +	__hlist_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __hlist_for_each_mutable2(pos, next, head)			\
> +	hlist_for_each_safe(pos, next, head)
> +
> +/**
> + * hlist_for_each_mutable - iterate over a hlist safe against entry removal
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct hlist_node to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your hlist.
> + */
> +#define hlist_for_each_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
>  #define hlist_entry_safe(ptr, type, member) \
>  	({ typeof(ptr) ____ptr = (ptr); \
>  	   ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
> @@ -1224,18 +1391,44 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (; pos;							\
>  	     pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
>  
> -/**
> - * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		a &struct hlist_node to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the hlist_node within the struct.
> +/*
> + * hlist_for_each_entry_safe is an old interface, use hlist_for_each_entry_mutable instead.
>   */
>  #define hlist_for_each_entry_safe(pos, n, head, member) 		\
>  	for (pos = hlist_entry_safe((head)->first, typeof(*pos), member);\
>  	     pos && ({ n = pos->member.next; 1; });			\
>  	     pos = hlist_entry_safe(n, typeof(*pos), member))
>  
> +#define __hlist_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (struct hlist_node *tmp = (pos =				\
> +		hlist_entry_safe((head)->first, typeof(*pos), member)) ? \
> +		pos->member.next : NULL;				\
> +	     pos;							\
> +	     pos = hlist_entry_safe((tmp), typeof(*pos), member),	\
> +		tmp = pos ? pos->member.next : NULL)
> +
> +#define __hlist_for_each_entry_mutable2(pos, head, member)		\
> +	__hlist_for_each_entry_mutable_internal(pos,			\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __hlist_for_each_entry_mutable3(pos, next, head, member)	\
> +	hlist_for_each_entry_safe(pos, next, head, member)
> +
> +/**
> + * hlist_for_each_entry_mutable - iterate over hlist safe against entry removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (next, head, member)
> + *
> + * next:	a &struct hlist_node to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your hlist.
> + * member:	the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_entry_mutable,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * hlist_count_nodes - count nodes in the hlist
>   * @head:	the head for your hlist.


^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-22  8:42 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026 12:05:31 +0800
Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

I'm not really sure 'mutable' means anything either.
It is possible to make it valid for the loop body (or even other threads)
to delete arbitrary list items - but that needs significant extra overheads.

It might be worth doing something that doesn't need the extra variable,
but there is little point doing all the churn just to rename things.

> 
> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.
> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\

Use auto

> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)

The variable argument count logic really just slows down compilation.
Maybe there aren't enough copies of this code to make that significant.
But just because you can do it doesn't mean it is a gooD idea.
I'm also not sure it really adds anything to the readability.

And, it you are going to make the middle argument optional there is
no need to change the macro name.

	David



^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Jani Nikula @ 2026-06-22  8:37 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> Add *_mutable() iterator variants for list, hlist and llist.  The new
> helpers are variadic and support both forms.  In the common case, the
> caller omits the temporary cursor and the macro creates a unique internal
> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
> explicit temporary cursor, the caller can still pass it and the helper
> keeps the existing *_safe() behaviour.
>
> For example, a call site may use the shorter form:
>
>   list_for_each_entry_mutable(pos, head, member)
>
> or keep the explicit temporary cursor form:
>
>   list_for_each_entry_mutable(pos, tmp, head, member)

I'm unconvinced it's a good idea to allow two forms with macro trickery,
*especially* when it's not the last argument you can omit. I think it's
a footgun.

IMO stick with the first form only, and there'll always be the _safe
variant that can be used when the temp pointer is needed.


BR,
Jani.


-- 
Jani Nikula, Intel

^ permalink raw reply

* [PATCH 2/2] blk-cgroup: fix null-ptr-deref by freeing blkg pd on blkg_create error path
From: Zizhi Wo @ 2026-06-22  7:07 UTC (permalink / raw)
  To: axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai, wozizhi
In-Reply-To: <20260622070714.1158886-1-wozizhi@huaweicloud.com>

From: Zizhi Wo <wozizhi@huawei.com>

When blkg_create() fails before the blkg is linked onto blkcg->blkg_list
and q->blkg_list (e.g. radix_tree_insert() fails or the blkg_lookup()
returns NULL), the blkg is freed asynchronously via blkg_free_workfn().

Since such a blkg was never linked, it is invisible to
blkcg_deactivate_policy(), so its blkg->pd[] entries can not be cleared in
it. blkg_free_workfn() then calls blkcg_policy->pd_free_fn() on them, which
can race with bfq module exit (bfq_exit() -> blkcg_policy_unregister())
clearing the blkcg_policy[] slot, leading to a NULL pointer dereference:

[   72.597786] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[   72.598690] CPU: 35 UID: 0 PID: 458 Comm: kworker/35:1 Not tainted 7.1.0+ #33 PREEMPT(full)
[   72.599518] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014
[   72.600342] Workqueue: events blkg_free_workfn
[   72.600991] RIP: 0010:blkg_free_workfn+0x115/0x3d0
......
[   72.613278] Call Trace:
[   72.613988]  <TASK>
[   72.614357]  process_one_work+0x6b4/0xff0
[   72.615251]  ? __pfx_blkg_free_workfn+0x10/0x10
[   72.616041]  ? assign_work+0x131/0x3f0
[   72.616962]  worker_thread+0x4eb/0xd50
[   72.617599]  ? __kthread_parkme+0x8d/0x170
[   72.618565]  ? __pfx_worker_thread+0x10/0x10
[   72.619566]  ? __pfx_worker_thread+0x10/0x10
[   72.620213]  kthread+0x327/0x410
......

Fix this by introducing blkg_free_pd() to synchronously free the pd and
clear blkg->pd[] in the blkg_create() error path, while the blkcg_policy
is still valid.

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
 block/blk-cgroup.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6386fe413994..673886d71c26 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -109,28 +109,37 @@ static struct cgroup_subsys_state *blkcg_css(void)
 	if (css)
 		return css;
 	return task_css(current, io_cgrp_id);
 }
 
+static void blkg_free_pd(struct blkcg_gq *blkg)
+{
+	int i;
+
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		if (blkg->pd[i]) {
+			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+			blkg->pd[i] = NULL;
+		}
+	}
+}
+
 static void blkg_free_workfn(struct work_struct *work)
 {
 	struct blkcg_gq *blkg = container_of(work, struct blkcg_gq,
 					     free_work);
 	struct request_queue *q = blkg->q;
-	int i;
 
 	/*
 	 * pd_free_fn() can also be called from blkcg_deactivate_policy(),
 	 * in order to make sure pd_free_fn() is called in order, the deletion
 	 * of the list blkg->q_node is delayed to here from blkg_destroy(), and
 	 * blkcg_mutex is used to synchronize blkg_free_workfn() and
 	 * blkcg_deactivate_policy().
 	 */
 	mutex_lock(&q->blkcg_mutex);
-	for (i = 0; i < BLKCG_MAX_POLS; i++)
-		if (blkg->pd[i])
-			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+	blkg_free_pd(blkg);
 	if (blkg->parent)
 		blkg_put(blkg->parent);
 	spin_lock_irq(&q->queue_lock);
 	list_del_init(&blkg->q_node);
 	spin_unlock_irq(&q->queue_lock);
@@ -436,19 +445,28 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	spin_unlock(&blkcg->lock);
 
 	if (!ret)
 		return blkg;
 
-	/* @blkg failed fully initialized, use the usual release path */
+	/*
+	 * @blkg failed fully initialized and never linked, so its pd[] is
+	 * invisible to blkcg_deactivate_policy(). Free pd[] synchronously
+	 * while blkcg_policy[] is still valid, otherwise the async free path
+	 * may call pd_free_fn() after the policy is unregistered (e.g. rmmod bfq).
+	 * The err_free_blkg path below frees pd[] for the same reason.
+	 */
+	blkg_free_pd(blkg);
 	percpu_ref_kill(&blkg->refcnt);
 	return ERR_PTR(ret);
 
 err_put_css:
 	css_put(&blkcg->css);
 err_free_blkg:
-	if (new_blkg)
+	if (new_blkg) {
+		blkg_free_pd(new_blkg);
 		blkg_free(new_blkg);
+	}
 	return ERR_PTR(ret);
 }
 
 /**
  * blkg_lookup_create - lookup blkg, try to create one if not there
-- 
2.52.0


^ permalink raw reply related

* [PATCH 0/2] fix two issues in blkg_create() error path
From: Zizhi Wo @ 2026-06-22  7:07 UTC (permalink / raw)
  To: axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai, wozizhi

From: Zizhi Wo <wozizhi@huawei.com>

This series fixes two issues on the blkg_create() error path.

Patch 1 fixes a blkg leak when blkg_create() fails.

Patch 2 fixes a null-ptr-deref by freeing blkg->pd synchronously on the
error path, otherwise the async free path may dereference an already
unregistered blkcg_policy.

Zizhi Wo (2):
  blk-cgroup: fix blkg leak in blkg_create() error path
  blk-cgroup: fix Null-ptr-deref by freeing blkg pd on blkg_create error
    path

 block/blk-cgroup.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

-- 
2.52.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox