Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Hongfu Li @ 2026-06-25  2:10 UTC (permalink / raw)
  To: tj
  Cc: cgroups, corbet, dev, dri-devel, hannes, lihongfu, linux-doc,
	linux-kernel, mkoutny, mripard, natalie.vock, skhan, hongfu.li
In-Reply-To: <ajwnf0uzT4PMHYZx@slm.duckdns.org>

Hi, Tejun
Thanks for the review comments.

> > Add dmem.events to report hierarchical low/max event counts per DMEM
> > region.  Increment counters on dmem.max allocation failures and
> > dmem.low protection events.  The file is available for non-root cgroups
> > only.
> 
> Please don't double space in descs or comments. Also, maybe it's obvious but
> it'd help if you list why and how this is useful. Why do we want to add
> this?

I'll fix the double spacing in the commit message and comments.

As for the motivation: dmem already exposes per-region limits and current
usage, but not how often those limits actually matter at runtime. Without
event counters, it's hard to tell whether allocation failures come from
this cgroup, a parent limit, or pressure elsewhere in the hierarchy.
dmem.events provides that visibility for tuning dmem.low/dmem.max and
diagnosing recurring device memory pressure.

I'll expand the commit message to cover this.

> > +  dmem.events
> > +	A read-only file that reports the number of times each cgroup
> > +	has hit its configured memory limits.  The format lists each
> > +	region on a single line, followed by the event counters::
> > +
> > +	  drm/0000:03:00.0/vram0 low 0 max 3
> > +	  drm/0000:03:00.0/stolen low 0 max 0
> 
> This isn't a supported file format. Please read the documentation on allowed
> formats.

Thanks for catching this. I'll switch dmem.events to nested-keyed format (region low=N max=M).

Thanks again for the valuable feedback.

Best regards,
Hongfu

^ permalink raw reply

* [PATCH v3 0/4] blk-cgroup: fix blkg list and policy data races
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel

From: Yu Kuai <yukuai@fygo.io>

Hi,

This series fixes races around q->blkg_list and blkg policy data
lifetime.

Patch 1 protects blkg_destroy_all()'s q->blkg_list walk with
blkcg_mutex.

Patches 2-3 fix races between blkcg_activate_policy() and concurrent
blkg destruction.

Patch 4 factors the policy data teardown loop into a helper after the
race fixes.

Changes since v2:
- Rebase on the latest block-7.2 branch.

Changes since v1:
- Drop the BFQ q->blkg_list patch because the current block tree already
  has a stronger fix in commit 17b2d950a3c0 ("block, bfq: protect async
  queue reset with blkcg locks").
- Add Reviewed-by tags from Tang Yizhou.

Yu Kuai (1):
  blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
    blkcg_mutex

Zheng Qixing (3):
  blk-cgroup: fix race between policy activation and blkg destruction
  blk-cgroup: skip dying blkg in blkcg_activate_policy()
  blk-cgroup: factor policy pd teardown loop into helper

 block/blk-cgroup.c | 65 +++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

-- 
2.51.0

^ permalink raw reply

* [PATCH v3 1/4] blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with blkcg_mutex
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.

Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d2a1f5903f24..d22a43c545b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -567,10 +567,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 	struct blkcg_gq *blkg;
 	int count = BLKG_DESTROY_BATCH_SIZE;
 	int i;
 
 restart:
+	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
@@ -585,10 +586,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 		 * it when a batch of blkgs are destroyed.
 		 */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
 	}
 
@@ -604,10 +606,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 			__clear_bit(pol->plid, q->blkcg_pols);
 	}
 
 	q->root_blkg = NULL;
 	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
 
 static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 2/4] blk-cgroup: fix race between policy activation and blkg destruction
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d22a43c545b6..fd1eed67924b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1564,10 +1564,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
 		return -EINVAL;
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
+	mutex_lock(&q->blkcg_mutex);
 retry:
 	spin_lock_irq(&q->queue_lock);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1626,10 +1628,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
 	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
 		blkg_put(pinned_blkg);
 	if (pd_prealloc)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 3/4] blk-cgroup: skip dying blkg in blkcg_activate_policy()
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.

T1:                               T2:
                                  blkg_destroy
                                  kill(&blkg->refcnt) // blkg->refcnt=1->0
                                  blkg_release // call_rcu(__blkg_release)
                                  ...
                                  blkg_free_workfn
                                  ->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
                                  list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf

Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fd1eed67924b..d06915045bc4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1575,10 +1575,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
 		struct blkg_policy_data *pd;
 
 		if (blkg->pd[pol->plid])
 			continue;
+		if (hlist_unhashed(&blkg->blkcg_node))
+			continue;
 
 		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 4/4] blk-cgroup: factor policy pd teardown loop into helper
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

Move the teardown sequence which offlines and frees per-policy
blkg_policy_data (pd) into a helper for readability.

No functional change intended.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 57 ++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d06915045bc4..0b28420c108f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1527,10 +1527,35 @@ struct cgroup_subsys io_cgrp_subsys = {
 	.depends_on = 1 << memory_cgrp_id,
 #endif
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+/*
+ * Tear down per-blkg policy data for @pol on @q.
+ */
+static void blkcg_policy_teardown_pds(struct request_queue *q,
+				      const struct blkcg_policy *pol)
+{
+	struct blkcg_gq *blkg;
+
+	list_for_each_entry(blkg, &q->blkg_list, q_node) {
+		struct blkcg *blkcg = blkg->blkcg;
+		struct blkg_policy_data *pd;
+
+		spin_lock(&blkcg->lock);
+		pd = blkg->pd[pol->plid];
+		if (pd) {
+			if (pd->online && pol->pd_offline_fn)
+				pol->pd_offline_fn(pd);
+			pd->online = false;
+			pol->pd_free_fn(pd);
+			WRITE_ONCE(blkg->pd[pol->plid], NULL);
+		}
+		spin_unlock(&blkcg->lock);
+	}
+}
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a gendisk
  * @disk: gendisk of interest
  * @pol: blkcg policy to activate
  *
@@ -1642,25 +1667,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	return ret;
 
 enomem:
 	/* alloc failed, take down everything */
 	spin_lock_irq(&q->queue_lock);
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-		struct blkg_policy_data *pd;
-
-		spin_lock(&blkcg->lock);
-		pd = blkg->pd[pol->plid];
-		if (pd) {
-			if (pd->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(pd);
-			pd->online = false;
-			pol->pd_free_fn(pd);
-			WRITE_ONCE(blkg->pd[pol->plid], NULL);
-		}
-		spin_unlock(&blkcg->lock);
-	}
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	ret = -ENOMEM;
 	goto out;
 }
 EXPORT_SYMBOL_GPL(blkcg_activate_policy);
@@ -1675,11 +1686,10 @@ EXPORT_SYMBOL_GPL(blkcg_activate_policy);
  */
 void blkcg_deactivate_policy(struct gendisk *disk,
 			     const struct blkcg_policy *pol)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *blkg;
 	unsigned int memflags;
 
 	if (!blkcg_policy_enabled(q, pol))
 		return;
 
@@ -1688,24 +1698,11 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 
 	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
-
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-
-		spin_lock(&blkcg->lock);
-		if (blkg->pd[pol->plid]) {
-			if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(blkg->pd[pol->plid]);
-			pol->pd_free_fn(blkg->pd[pol->plid]);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
-
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [ANNOUNCE/CFP] Linux Plumbers 2026 Containers and Checkpoint/Restore Microconference
From: Kamalesh Babulal @ 2026-06-25  3:55 UTC (permalink / raw)
  To: cgroups, containers, bpf, linux-fsdevel, linux-api,
	linux-integrity, criu, lxc-devel, fuse-devel
  Cc: Stéphane Graber, Mike Rapoport, Christian Brauner,
	Michal Koutný, Adrian Reber, Kamalesh Babulal

Hello,

We are pleased to announce the Call for Proposals for the Containers and
Checkpoint/Restore Microconference[0] at Linux Plumbers Conference 2026,
taking place in Prague, Czechia, from October 5 to 7, 2026.

This microconference will focus on current work and open problems in
containers, checkpoint/restore, kernel interfaces, and related userspace
tooling. We hope to bring together people working on container
runtimes, CRIU, init systems, distributions, orchestration systems, and
the kernel interfaces that make these pieces work together.

Topics of interest include, but are not limited to:

  - New VFS and syscall interfaces relevant to containers, including
    work around idmapped mounts

  - Closing remaining gaps between cgroup v1 and cgroup v2, and making
    migration easier

  - The growing role of eBPF in container runtimes, observability,
    policy enforcement, and checkpoint/restore

  - Mechanisms for mediating and intercepting increasingly complex
    system calls

  - Lowering the barriers to practical use of user namespaces

  - Attestation, measurement, and other approaches to establishing
    container integrity

  - Better resource-control interfaces and limits for containerized
    workloads

  - Keeping CRIU working smoothly on modern Linux distributions

  - Checkpoint/restore support for GPUs and similar accelerators

  - Restoring FUSE daemons and related userspace services

  - Handling restartable sequences correctly during checkpoint and
    restore

  - Support for newly added kernel features and interfaces

  - Shadow stack support on x86 and arm64

  - Support for madvise(MADV_GUARD_INSTALL) and mseal()

  - pidfd-based checkpoint/restore, including process-exit information

We are also interested in additional topics that may emerge as work
evolves over the coming months. Ongoing development work, operational
experience, unresolved kernel API questions, and cross-project
coordination topics are all welcome.

We encourage you to bring open questions, unresolved issues, or problems
that would benefit from input from others. In your proposal, please
include a short description of the topic, what you would like to
discuss, and what kind of feedback or collaboration would help move the
work forward.

Allocated time per session is expected to be between 15 and 30 minutes.

Please submit proposals through the LPC 2026 abstracts page by August 7:

        https://lpc.events/event/20/abstracts/

Linux Plumbers Conference 2026 will be a hybrid event. While in-person
presentation is preferred to help keep the sessions smooth and
interactive, remote presentation will also be available.

We are looking forward to your proposals and to seeing you in Prague.

[0] https://lpc.events/event/20/contributions/2332/

Thanks,
Containers & Checkpoint/Restart Microconference Team

^ permalink raw reply

* Re: [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs
From: Waiman Long @ 2026-06-25  5:27 UTC (permalink / raw)
  To: Jing Wu; +Cc: Thomas Gleixner, linux-kernel, rcu, cgroups, Qiliang Yuan
In-Reply-To: <20260624063404.2106807-1-realwujing@gmail.com>

On 6/24/26 2:34 AM, Jing Wu wrote:
> Hi Waiman,
>
> Thomas Gleixner suggested we coordinate, so reaching out directly.
>
> We have been working on a similar feature called Dynamic Housekeeping
> Management (DHM) [1][2][3][4]. The RFC was posted on 2026-02-06, v1 on
> 2026-03-25, and v2 on 2026-04-13 — a week before your series appeared.
> It seems we developed these independently in parallel.
>
> After Thomas's review of DHM v3, we are rebuilding v4 around the
> CPU-by-CPU offline/online hotplug mechanism, which aligns with the
> direction of your series.
>
> There is one key difference in scope worth discussing:
>
>    Your series requires "nohz_full=" to be present at boot (even with
>    an empty CPU list) to opt into runtime updates. DHM targets systems
>    where nohz_full= was never configured at boot — enabling CPU noise
>    isolation purely at runtime without any boot-time setup.
>
>    This requires making the nohz_full infrastructure activatable at
>    runtime for the first time, rather than just extending an already-
>    initialized boot configuration.
>
> Before we start coding v4, a few questions:
>
>    1. Are you planning a v2 of your series? If so, what is your
>       timeline? We want to avoid duplicating effort on the subsystem
>       patches (tick, RCU, genirq).

Yes, I am planning to send out a v2 in a few weeks depending on whether 
I can finish the other works that I am doing right now.


>
>    2. Would you be open to extending your series to cover the
>       "no boot parameter" use case, or do you think it is better kept
>       as a separate series?
The reason to make the v1 series depending on the nohz_full parameter is 
basically a short cut as some code will change its behavior slightly 
depending on if the nohz_full parameter is set. By making it optional, 
we just have to add more code to enable them. It is more work, but 
doable. I will make that optional in the next version, but I probably 
won't have all the needed code other than the essential ones and the 
rests will be handled in a followup patch series.
>
>    3. Are there specific patches in your series where you would welcome
>       our contribution directly?

I have broken down the shutdown callback into separate portions as 
suggested by Thomas. The other major change that I am working on is to 
try to shutdown to only CPUHP_AP_OFFLINE state instead of all the way 
down to CPUHP_OFFLINE. That will require some adjustments to the 
nohz_full related hotplug functions. I have some ideas of what needs to 
be done. However, I haven't looked into RCU yet. I know RCU support 
changing the nocb mask for fully offline CPUs, I will need to find out 
if it possible to do that for partially offline CPUs.

The work has been suspended for a while as I have other works to do. 
Hopefully I can restart it soon to further refresh my memory and we can 
discuss collaboration at that point.

Cheers,
Longman


>
> Happy to collaborate on a unified approach.
>
> [1] DHM RFC (2026-02-06): https://lore.kernel.org/r/20260206-feature-dynamic_isolcpus_dhei-v1-0-00a711eb0c74@gmail.com
> [2] DHM v1  (2026-03-25): https://lore.kernel.org/r/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
> [3] DHM v2  (2026-04-13): https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
> [4] DHM v3  (2026-06-18): https://lore.kernel.org/r/20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com
> [5] Your series v1 (2026-04-20): https://lore.kernel.org/r/20260421030351.281436-1-longman@redhat.com
>
> Jing Wu <realwujing@gmail.com>
> Qiliang Yuan <yuanql9@chinatelecom.cn>
>


^ permalink raw reply

* Re: [PATCH v3] selftests/cgroup: Adjust cpu test duration based on HZ
From: Michal Koutný @ 2026-06-25  8:23 UTC (permalink / raw)
  To: Joe Simmons-Talbott
  Cc: Tejun Heo, Johannes Weiner, Shuah Khan, cgroups, linux-kselftest,
	linux-kernel
In-Reply-To: <20260624160358.430354-1-joest@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

Hi.

On Wed, Jun 24, 2026 at 12:03:57PM -0400, Joe Simmons-Talbott <joest@redhat.com> wrote:
> +/*
> + * Best effort attempt to get the kernel's HZ value from the config.
> + * Return the HZ value if found otherwise return -1 to indicate failure.
> + */
> +static long
> +_get_config_hz(void)

drop underscore from the static function

> +{
> +	long hz = -1;

use the default 1000 here to simplify the callers

> +	FILE *f;
> +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> +
> +	f = popen(cmd, "r");
> +
> +	if (!f)
> +		return hz;
> +
> +	if (fscanf(f, "CONFIG_HZ=%ld", &hz) == EOF)
> +		goto out;
> +
> +out:
> +	pclose(f);
> +	return hz;
> +}
> +
>  /*
>   * This test creates a cgroup with some maximum value within a period, and
>   * verifies that a process in the cgroup is not overscheduled.
> @@ -646,15 +670,21 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
>  static int test_cpucg_max(const char *root)
>  {
>  	int ret = KSFT_FAIL;
> +	long hz = _get_config_hz();
>  	long quota_usec = 1000;
>  	long default_period_usec = 100000; /* cpu.max's default period */
> -	long duration_seconds = 1;
> +	long duration_seconds;
>  
> -	long duration_usec = duration_seconds * USEC_PER_SEC;
> +	long duration_usec;
>  	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
>  	char *cpucg;
>  	char quota_buf[32];
>  
> +	if (hz == -1)
> +		hz = 1000;
> +	duration_seconds = 1000 / hz;
> +	duration_usec = duration_seconds * USEC_PER_SEC;

I'd do the calculation in usecs

	duration_usec = duration_seconds * USEC_PER_SEC * 1000 / hz;

so that actual duration is more precise (for hz=300 which is the only
that doesn't divide 1000)

All in all, make the adjustments for HZ with less code (since I expect
this will need adjustments for SMPs in future).

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox