Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-06-16 20:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zMHGFG_jcVeDPgowaQ2RNntp3KankwzQdgrJb9PrWu8_w@mail.gmail.com>

On Tue, Jun 16, 2026 at 4:10 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Tue, Jun 16, 2026 at 1:09 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Jun 16, 2026 at 3:54 PM Yosry Ahmed <yosry@kernel.org> wrote:
> > >
> > > On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > TBH, without vswap, we should not allow setting zswap as its own tier.
> > > > It's meaningless. Maybe makes it a no-op, and warn users what they're
> > > > setting is gibberish?
> > >
> > > Why? vswap is transparent to the user. Why can't zswap be its own tier?
> >
> > Without vswap, if you set zswap as its own tier, which phys swap
> > device should we allocate from for the backing slot? :)
>
> Today we just allocate a swap slot in a swapfile during reclaim,
> before swapout, and zswap will just writeback to that one. I assume
> the same will work with swap tiering, except that maybe the way that
> swap slot will respect the allowed swap tiers?

Yep! So if we set zswap as the only tier, then it wouldn't be able to
allocate a swap slot in swapfile right?

Or are you suggesting that if we set zswap as the only tier then we
can allocate from any swapfile (since we're not doing any IO anyway)?

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-16 20:10 UTC (permalink / raw)
  To: Nhat Pham
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAKEwX=N=Umi94wdKcLxEWOqUwhz6=Lj909pc1Pr_5ivVnZmdPQ@mail.gmail.com>

On Tue, Jun 16, 2026 at 1:09 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jun 16, 2026 at 3:54 PM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > TBH, without vswap, we should not allow setting zswap as its own tier.
> > > It's meaningless. Maybe makes it a no-op, and warn users what they're
> > > setting is gibberish?
> >
> > Why? vswap is transparent to the user. Why can't zswap be its own tier?
>
> Without vswap, if you set zswap as its own tier, which phys swap
> device should we allocate from for the backing slot? :)

Today we just allocate a swap slot in a swapfile during reclaim,
before swapout, and zswap will just writeback to that one. I assume
the same will work with swap tiering, except that maybe the way that
swap slot will respect the allowed swap tiers?

>
> With vswap then it makes sense (and would probably be the "default"
> for zswap setup until we enable zswap writeback).

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-06-16 20:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zOD7XaJ0Uo_LLLDTRKbeTOmAwmM3q8q6rUyH3oS-X3Csw@mail.gmail.com>

On Tue, Jun 16, 2026 at 3:54 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > TBH, without vswap, we should not allow setting zswap as its own tier.
> > It's meaningless. Maybe makes it a no-op, and warn users what they're
> > setting is gibberish?
>
> Why? vswap is transparent to the user. Why can't zswap be its own tier?

Without vswap, if you set zswap as its own tier, which phys swap
device should we allocate from for the backing slot? :)

With vswap then it makes sense (and would probably be the "default"
for zswap setup until we enable zswap writeback).

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-16 19:54 UTC (permalink / raw)
  To: Nhat Pham
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAKEwX=Nz9SWcEVQGQjHN8P8OANJY4BG0w+iQOzoNOWuteoVjAg@mail.gmail.com>

On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jun 16, 2026 at 1:31 PM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Mon, Jun 15, 2026 at 8:08 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > ...
> > > > - "zswap tier only": Only zswap is allowed. Fallback to other swap is
> > > >   blocked.
> > > > - "zswap writeback disabled": zswap is allowed, but if zswap_store()
> > > >   fails, pages can still fall back to other swap devices.
> > >
> > > Upon double-checking the code, my previous clarification was wrong.
> > > You are right. Sorry for the confusion. "zswap tier only" is indeed
> > > equivalent to "zswap writeback disabled".
> > > (I'm not sure why I read the code that way...)
> > >
> > > As I initially thought, it might be possible to replace the zswap writeback
> > > control with the tiering mechanism.
> > >
> > > If we need to keep the existing interface, we can integrate or share the
> > > underlying logic (though the specific details need more thought anyway).
> > >
> > > It can be summarized as follows:
> > >
> > > - "zswap tier only" + "zswap writeback disable" -> meaningless (noop)
> > > - "zswap tier only" + "zswap writeback enable" -> meaningless (no writabck backend exist)
> > > - "zswap tier with other tiers" + "zswap writeback disable" -> uses only zswap
> > >   (can be replaced by "zswap tier only". This code could be intergrated, modified or something.)
> > > - "zswap tier with other tiers" + "zswap writeback enable" -> works as is
>
> TBH, without vswap, we should not allow setting zswap as its own tier.
> It's meaningless. Maybe makes it a no-op, and warn users what they're
> setting is gibberish?

Why? vswap is transparent to the user. Why can't zswap be its own tier?

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-06-16 18:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zMimM8n54BL1viuX3pYzO=wzQU89LhCF1HW0bAv97ZQtg@mail.gmail.com>

On Tue, Jun 16, 2026 at 1:31 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Mon, Jun 15, 2026 at 8:08 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > ...
> > > - "zswap tier only": Only zswap is allowed. Fallback to other swap is
> > >   blocked.
> > > - "zswap writeback disabled": zswap is allowed, but if zswap_store()
> > >   fails, pages can still fall back to other swap devices.
> >
> > Upon double-checking the code, my previous clarification was wrong.
> > You are right. Sorry for the confusion. "zswap tier only" is indeed
> > equivalent to "zswap writeback disabled".
> > (I'm not sure why I read the code that way...)
> >
> > As I initially thought, it might be possible to replace the zswap writeback
> > control with the tiering mechanism.
> >
> > If we need to keep the existing interface, we can integrate or share the
> > underlying logic (though the specific details need more thought anyway).
> >
> > It can be summarized as follows:
> >
> > - "zswap tier only" + "zswap writeback disable" -> meaningless (noop)
> > - "zswap tier only" + "zswap writeback enable" -> meaningless (no writabck backend exist)
> > - "zswap tier with other tiers" + "zswap writeback disable" -> uses only zswap
> >   (can be replaced by "zswap tier only". This code could be intergrated, modified or something.)
> > - "zswap tier with other tiers" + "zswap writeback enable" -> works as is

TBH, without vswap, we should not allow setting zswap as its own tier.
It's meaningless. Maybe makes it a no-op, and warn users what they're
setting is gibberish?


>
> Hmm we might want to somehow disable memory.zswap.writeback if tiering
> is enabled, to avoid having to deal with this. But I am not sure how
> possible this is.

With tiering and without vswap, you still need an interface to
prescribe that a cgroup can:

1. Allocate slots from a certain swap device.

2. Use zswap, backed by those slots.

3. But no IO is allowed to those slots (no writeback or fallback on
zswap failure).

So we still need memory.zswap.writeback, until we get rid of non-vswap
case for zswap.

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-16 17:30 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <ajC+FNpkVpI4pbBz@yjaykim-PowerEdge-T330>

On Mon, Jun 15, 2026 at 8:08 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> ...
> > - "zswap tier only": Only zswap is allowed. Fallback to other swap is
> >   blocked.
> > - "zswap writeback disabled": zswap is allowed, but if zswap_store()
> >   fails, pages can still fall back to other swap devices.
>
> Upon double-checking the code, my previous clarification was wrong.
> You are right. Sorry for the confusion. "zswap tier only" is indeed
> equivalent to "zswap writeback disabled".
> (I'm not sure why I read the code that way...)
>
> As I initially thought, it might be possible to replace the zswap writeback
> control with the tiering mechanism.
>
> If we need to keep the existing interface, we can integrate or share the
> underlying logic (though the specific details need more thought anyway).
>
> It can be summarized as follows:
>
> - "zswap tier only" + "zswap writeback disable" -> meaningless (noop)
> - "zswap tier only" + "zswap writeback enable" -> meaningless (no writabck backend exist)
> - "zswap tier with other tiers" + "zswap writeback disable" -> uses only zswap
>   (can be replaced by "zswap tier only". This code could be intergrated, modified or something.)
> - "zswap tier with other tiers" + "zswap writeback enable" -> works as is

Hmm we might want to somehow disable memory.zswap.writeback if tiering
is enabled, to avoid having to deal with this. But I am not sure how
possible this is.

If swap tiering is behind a config option, maybe we can disable
memory.zswap.writeback under that config option? But then if distros
start enabling the config option it might break some users. Not sure
what's the right way to do this, but having both interfaces active at
the same time is annoying.

>
> As mentioned in the previous email, the zswap tier on/off control comes as a
> bonus (though, as you pointed out, we may need to discuss if it's actually
> needed).
>
> BR,
> Youngjun

^ permalink raw reply

* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:50 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>

On 16/6/26 9:17 am, Zizhi Wo wrote:
> From: Zizhi Wo <wozizhi@huawei.com>
> 
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
> 
>   BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>   Call Trace:
>   ...
>   blkcg_deactivate_policy+0x244/0x4d0
>   ioc_rqos_exit+0x44/0xe0
>   rq_qos_exit+0xba/0x120
>   __del_gendisk+0x50b/0x800
>   del_gendisk+0xff/0x190
>   ...
> 
> [CAUSE]
> process1						process2
> cgroup_rmdir
> ...
>   css_killed_work_fn
>     offline_css
>     ...
>       blkcg_destroy_blkgs
>       ...
>         __blkg_release
> 	  css_put(&blkg->blkcg->css)
>           blkg_free
> 	    INIT_WORK(xxx, blkg_free_workfn)
> 	    schedule_work
>     css_put
>     ...
>       blkcg_css_free
>         kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
>               blkg_free_workfn
> 							__del_gendisk
> 							  rq_qos_exit
> 							    ioc_rqos_exit
> 							      blkcg_deactivate_policy
> 							        mutex_lock(&q->blkcg_mutex)
> 								spin_lock_irq(&q->queue_lock)
> 							        list_for_each_entry(blkg, xxx)
> 								  blkcg = blkg->blkcg
> 								  spin_lock(&blkcg->lock)-------UAF!!!
> 	        mutex_lock(&q->blkcg_mutex)
> 	        spin_lock_irq(&q->queue_lock)
> 	        /* Only then is the blkg removed from the list */
> 	        list_del_init(&blkg->q_node)
> 
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
> 
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
> 
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc()  <-> blkg_free()
> blkg_create() <-> blkg_destroy()
> 
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> Reviewed-by: Yu Kuai <yukuai@fygo.io>
> ---
> v3:
>  - move css_put() after mutex_unlock() in blkg_free_workfn().
> 
> v2:
>  - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>    css reference follows the blkg's own lifetime, making the put in
>    blkg_free_workfn() symmetric with the get in blkg_alloc().
> 
> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>  block/blk-cgroup.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..3ac41f766caf 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -136,6 +136,11 @@ static void blkg_free_workfn(struct work_struct *work)
>  	spin_unlock_irq(&q->queue_lock);
>  	mutex_unlock(&q->blkcg_mutex);
>  
> +	/*
> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
> +	 */
> +	css_put(&blkg->blkcg->css);
>  	blk_put_queue(q);
>  	free_percpu(blkg->iostat_cpu);
>  	percpu_ref_exit(&blkg->refcnt);
> @@ -179,8 +184,6 @@ static void __blkg_release(struct rcu_head *rcu)
>  	for_each_possible_cpu(cpu)
>  		__blkcg_rstat_flush(blkcg, cpu);
>  
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
>  	blkg_free(blkg);
>  }
>  
> @@ -313,6 +316,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  		goto out_exit_refcnt;
>  	if (!blk_get_queue(disk->queue))
>  		goto out_free_iostat;
> +	/* blkg holds a reference to blkcg */
> +	if (!css_tryget_online(&blkcg->css))
> +		goto out_put_queue;
>  
>  	blkg->q = disk->queue;
>  	INIT_LIST_HEAD(&blkg->q_node);
> @@ -353,6 +359,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  	while (--i >= 0)
>  		if (blkg->pd[i])
>  			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> +	css_put(&blkcg->css);
> +out_put_queue:
>  	blk_put_queue(disk->queue);
>  out_free_iostat:
>  	free_percpu(blkg->iostat_cpu);
> @@ -381,18 +389,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		goto err_free_blkg;
>  	}
>  
> -	/* blkg holds a reference to blkcg */
> -	if (!css_tryget_online(&blkcg->css)) {
> -		ret = -ENODEV;
> -		goto err_free_blkg;
> -	}
> -
>  	/* allocate */
>  	if (!new_blkg) {
>  		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>  		if (unlikely(!new_blkg)) {
>  			ret = -ENOMEM;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  	}
>  	blkg = new_blkg;
> @@ -402,7 +404,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>  		if (WARN_ON_ONCE(!blkg->parent)) {
>  			ret = -ENODEV;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  		blkg_get(blkg->parent);
>  	}
> @@ -442,8 +444,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  	blkg_put(blkg);
>  	return ERR_PTR(ret);
>  
> -err_put_css:
> -	css_put(&blkcg->css);
>  err_free_blkg:
>  	if (new_blkg)
>  		blkg_free(new_blkg);

LGTM.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>

-- 
Best Regards,
Yi


^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:44 UTC (permalink / raw)
  To: Hou Tao, yukuai, Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1
In-Reply-To: <8bdf88b3-0879-e3ec-a52d-3e7559bfddbb@huaweicloud.com>

On 16/6/26 9:23 am, Hou Tao wrote:
> Hi,
> 
> On 6/16/2026 12:16 AM, Yu Kuai wrote:
>> Hi，
>>
>> 在 2026/6/15 19:55, Zizhi Wo 写道:
>>> From: Zizhi Wo <wozizhi@huawei.com>
>>>
>>> [BUG]
>>> Our fuzz testing triggered a blkcg use-after-free issue:
>>>
>>>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>>>    Call Trace:
>>>    ...
>>>    blkcg_deactivate_policy+0x244/0x4d0
>>>    ioc_rqos_exit+0x44/0xe0
>>>    rq_qos_exit+0xba/0x120
>>>    __del_gendisk+0x50b/0x800
>>>    del_gendisk+0xff/0x190
>>>    ...
>>>
>>> [CAUSE]
>>> process1						process2
>>> cgroup_rmdir
>>> ...
>>>    css_killed_work_fn
>>>      offline_css
>>>      ...
>>>        blkcg_destroy_blkgs
>>>        ...
>>>          __blkg_release
>>> 	  css_put(&blkg->blkcg->css)
>>>            blkg_free
>>> 	    INIT_WORK(xxx, blkg_free_workfn)
>>> 	    schedule_work
>>>      css_put
>>>      ...
>>>        blkcg_css_free
>>>          kfree(blkcg)--------blkcg has been freed!!!
>>> ====================================schedule_work
>>>                blkg_free_workfn
>>> 							__del_gendisk
>>> 							  rq_qos_exit
>>> 							    ioc_rqos_exit
>>> 							      blkcg_deactivate_policy
>>> 							        mutex_lock(&q->blkcg_mutex)
>>> 								spin_lock_irq(&q->queue_lock)
>>> 							        list_for_each_entry(blkg, xxx)
>>> 								  blkcg = blkg->blkcg
>>> 								  spin_lock(&blkcg->lock)-------UAF!!!
>>> 	        mutex_lock(&q->blkcg_mutex)
>>> 	        spin_lock_irq(&q->queue_lock)
>>> 	        /* Only then is the blkg removed from the list */
>>> 	        list_del_init(&blkg->q_node)
>>>
>>> As a result, a blkg can still be reachable through q->blkg_list while
>>> its ->blkcg has already been freed.
>>>
>>> [Fix]
>>> Fix this by deferring the blkcg css_put() until after the blkg has been
>>> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
>>> blkcg outlives every blkg still reachable through q->blkg_list, so any
>>> iterator holding q->queue_lock is guaranteed to observe a valid
>>> blkg->blkcg.
>>>
>>> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
>>> so that the css reference is owned by the alloc/free pair rather than
>>> straddling layers:
>>> blkg_alloc()  <-> blkg_free()
>>> blkg_create() <-> blkg_destroy()
>>>
>>> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
>>> Suggested-by: Hou Tao <houtao1@huawei.com>
>>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>>> ---
>>> v2:
>>>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>>>     css reference follows the blkg's own lifetime, making the put in
>>>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>>>
>>> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>>>
>>>   block/blk-cgroup.c | 24 ++++++++++++------------
>>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>> index bc63bd220865..27414c291e49 100644
>>> --- a/block/blk-cgroup.c
>>> +++ b/block/blk-cgroup.c
>>> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>>>   	if (blkg->parent)
>>>   		blkg_put(blkg->parent);
>>>   	spin_lock_irq(&q->queue_lock);
>>>   	list_del_init(&blkg->q_node);
>>>   	spin_unlock_irq(&q->queue_lock);
>>> +	/*
>>> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
>>> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
>>> +	 */
>>> +	css_put(&blkg->blkcg->css);
>>>   	mutex_unlock(&q->blkcg_mutex);
>> Please move css_put after mutex_unlock, unless there is a strong reason.
> 
> I think blkcg_mutex is used here to serialize the access of blkg->q_node
> and blkg->blkcg. We could move the css_put after the mutex_unlock(),
> however it stills depends on the mutex_lock and mutex_unlock pair on
> blkcg_mutex implicitly. Instead of such implicit dependency, we move the
> css_put inside the lock to make it be explicit.

Hi, I think I understand your point. Keeping css_put() inside blkcg_mutex makes the dependency explicit, since the same mutex serializes both the removal of blkg->q_node and the access to blkg->blkcg.

Placing css_put() after mutex_unlock(&q->blkcg_mutex) is still functionally correct. The blkg has already been removed from q->blkg_list under the mutex, so once we drop the mutex no iterator can reach this blkg anymore.

The benefit of moving it out is a smaller critical section.

-- 
Best Regards,
Yi

>>
>> With above change, feel free to add:
>>
>> Reviewed-by: Yu Kuai <yukuai@fygo.io>
>>
>>>   
>>>   	blk_put_queue(q);
>>>   	free_percpu(blkg->iostat_cpu);
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>>>   	 * blkg_stat_lock is for serializing blkg stat update
>>>   	 */
>>>   	for_each_possible_cpu(cpu)
>>>   		__blkcg_rstat_flush(blkcg, cpu);
>>>   
>>> -	/* release the blkcg and parent blkg refs this blkg has been holding */
>>> -	css_put(&blkg->blkcg->css);
>>>   	blkg_free(blkg);
>>>   }
>>>   
>>>   /*
>>>    * A group is RCU protected, but having an rcu lock does not mean that one
>>> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>>   	if (!blkg->iostat_cpu)
>>>   		goto out_exit_refcnt;
>>>   	if (!blk_get_queue(disk->queue))
>>>   		goto out_free_iostat;
>>> +	/* blkg holds a reference to blkcg */
>>> +	if (!css_tryget_online(&blkcg->css))
>>> +		goto out_put_queue;
>>>   
>>>   	blkg->q = disk->queue;
>>>   	INIT_LIST_HEAD(&blkg->q_node);
>>>   	blkg->blkcg = blkcg;
>>>   	blkg->iostat.blkg = blkg;
>>> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   out_free_pds:
>>>   	while (--i >= 0)
>>>   		if (blkg->pd[i])
>>>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>>> +	css_put(&blkcg->css);
>>> +out_put_queue:
>>>   	blk_put_queue(disk->queue);
>>>   out_free_iostat:
>>>   	free_percpu(blkg->iostat_cpu);
>>>   out_exit_refcnt:
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   	if (blk_queue_dying(disk->queue)) {
>>>   		ret = -ENODEV;
>>>   		goto err_free_blkg;
>>>   	}
>>>   
>>> -	/* blkg holds a reference to blkcg */
>>> -	if (!css_tryget_online(&blkcg->css)) {
>>> -		ret = -ENODEV;
>>> -		goto err_free_blkg;
>>> -	}
>>> -
>>>   	/* allocate */
>>>   	if (!new_blkg) {
>>>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>>>   		if (unlikely(!new_blkg)) {
>>>   			ret = -ENOMEM;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   	}
>>>   	blkg = new_blkg;
>>>   
>>>   	/* link parent */
>>>   	if (blkcg_parent(blkcg)) {
>>>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>>>   		if (WARN_ON_ONCE(!blkg->parent)) {
>>>   			ret = -ENODEV;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   		blkg_get(blkg->parent);
>>>   	}
>>>   
>>>   	/* invoke per-policy init */
>>> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   	/* @blkg failed fully initialized, use the usual release path */
>>>   	blkg_put(blkg);
>>>   	return ERR_PTR(ret);
>>>   
>>> -err_put_css:
>>> -	css_put(&blkcg->css);
>>>   err_free_blkg:
>>>   	if (new_blkg)
>>>   		blkg_free(new_blkg);
>>>   	return ERR_PTR(ret);
>>>   }
> 
> 



^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-16 15:27 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <ajFTVaMBeu-ViGIC@gourry-fedora-PF4VCD3F>

On 6/16/26 9:44 AM, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 08:59:07AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/16/26 05:43, Waiman Long wrote:
>>> On 6/15/26 10:26 PM, Waiman Long wrote:
>>>>
>>>> The reason why I am suggesting to use cs->effective_mems to keep the old
>>>> cgroup v1 behavior. If the consensus is to use the output of
>>>> guarantee_online_mems() for mpol_rebind_mm(), I will not be against that but
>>>> it will be a slight change in user-visible behavior.
> I'm not grok'ing what is user-visible here.
>
> The two values should effectively be equivalent because we're
> using this value to constrain mpol's during a hotplug event.
>
> If the values differed, you would be saying there's a race condition
> that could affect correctness of the rebind (which can't happen,
> because this whole thing is done under the hotplug lock btw).
>
> Can you help me understand?

The effective_mems and guarantee_online_mems() should be more or less 
equivalent for v2, but not for v1. In my response to David, I gave an 
example of how the behavior will differ.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-16 15:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Gregory Price
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <51eafe6c-6622-479b-b391-6d3ff9350e75@kernel.org>

On 6/16/26 2:59 AM, David Hildenbrand (Arm) wrote:
> On 6/16/26 05:43, Waiman Long wrote:
>> On 6/15/26 10:26 PM, Waiman Long wrote:
>>> On 6/15/26 5:38 AM, Gregory Price wrote:
>>>> All interactions between mempolicy and cpuset are horrible and
>>>> confusing.  Much like Lorenzo's anon_vma work, I have to keep
>>>> notes on how this whole thing doesn't just spew SIGBUS constantly.
>>>>
>>>> The short answer is: mempolicy is advisory and cpuset is strictly
>>>> followed - in a dispute cpuset wins... except for file backed memory,
>>>> then everyon loses and nothing is consistent.
>>> That is what I believe why mpol_rebind_mm() a bit differently from the others
>>> and it is historically done this way a long time ago before cgroup v2.
>>>
>>> For cgroup v1, mems_allowed can't be empty or you can't put any task into the
>>> cpuset. Also effective_mems is the same as mems_allowed. cgroup v2 is quite
>>> different in how it handles memory nodes and CPUs. Users can isn't forced to
>>> set mems_allowed and cpus_allowed as effective_mems and effective_cpus will
>>> inherit parent version if mems_allowed and cpus_allowed are not set. IOW,
>>> effective_mems will never be empty. Yes, it is a bug with the introduction of
>>> cpuset v2 that we should have replaced mems_allowed by effective_mems at that
>>> time. With v2, effective_mems should contain only online nodes. The only
>>> exception is during the short transition period when a memory node hotunplug
>>> operation is in progress when a write to cpuset.mems is happening at the same
>>> time. With v1, it is theoretically possible that none of the nodes in
>>> mems_allowed is online.
>>>
>>> The reason why I am suggesting to use cs->effective_mems to keep the old
>>> cgroup v1 behavior. If the consensus is to use the output of
>>> guarantee_online_mems() for mpol_rebind_mm(), I will not be against that but
>>> it will be a slight change in user-visible behavior.
>> BTW, I still prefer the v2 patch. If it is decided we should use the
>> guarantee_online_mems() value instead, it will have to be a separate patch with
>> changes in the relevant documentation like Documentation/admin-guide/cgroup-v1/
>> cpuset.rst.
> newmems is "obviously" correct, so I really don't see why we should add
> something that needs half a page of text to explain why it is fine -- if newmems
> just does the trick?
>
> Please enlighten me.

Yes, taking newmems is a reasonable choice and there are pros and cons 
with each options. My focus is more on not changing how v1 cpuset 
behaves as it is well defined in the v1 cpusets.rst file:

     Requests by a task, using the sched_setaffinity(2) system call to
     include CPUs in its CPU affinity mask, and using the mbind(2) and
     set_mempolicy(2) system calls to include Memory Nodes in its memory
     policy, are both filtered through that task's cpuset, filtering out any
     CPUs or Memory Nodes not in that cpuset.  The scheduler will not
     schedule a task on a CPU that is not allowed in its cpus_allowed
     vector, and the kernel page allocator will not allocate a page on a
     node that is not allowed in the requesting task's mems_allowed vector.

v2, OTOH, is more vague as to what setting cpuset.mems will mean and we 
generally follow what v1 is doing, but we have more leeway of what we 
can do.

Using newmems will make the above text not totally correct. At least the 
offline memory nodes will be filtered out which will not be utilized by 
the task when the offline node becomes online. That is why I am saying 
that we will have to correct the documentation if we want to make this 
change.

Cheers,
Longman


^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-16 13:47 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Vlastimil Babka (SUSE), David Hildenbrand (Arm), Balbir Singh,
	lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <DJAGEUY8S09F.3V3HF570G85OF@linux.dev>

On Tue, Jun 16, 2026 at 11:57:42AM +0000, Brendan Jackman wrote:
> On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> > On 6/12/26 17:29, Gregory Price wrote:
> >> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> >>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> >>> > > 
> >
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
> 
> Minor side-note: couldn't we just define it such that the allocator
> ignores the context when not in_task() (and warn if you try to enter the
> context while not currently in_task())?
> 
> (Don't think this would change the conclusion very much, e.g. doesn't
> help with the nesting issues. Mostly curious in case I'm missing a
> detail here).
>

I looked at this - only solves one issue and oh boy is that an obtuse
confusing condition to understand.  We still suffer from recursion in
reclaim.

~Gregory

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Gregory Price @ 2026-06-16 13:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Waiman Long, Farhad Alemi, Andrew Morton, Farhad Alemi,
	Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim,
	Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes,
	linux-mm, linux-kernel, cgroups, stable
In-Reply-To: <51eafe6c-6622-479b-b391-6d3ff9350e75@kernel.org>

On Tue, Jun 16, 2026 at 08:59:07AM +0200, David Hildenbrand (Arm) wrote:
> On 6/16/26 05:43, Waiman Long wrote:
> > On 6/15/26 10:26 PM, Waiman Long wrote:
> >>
> >>
> >> The reason why I am suggesting to use cs->effective_mems to keep the old
> >> cgroup v1 behavior. If the consensus is to use the output of
> >> guarantee_online_mems() for mpol_rebind_mm(), I will not be against that but
> >> it will be a slight change in user-visible behavior.

I'm not grok'ing what is user-visible here.

The two values should effectively be equivalent because we're
using this value to constrain mpol's during a hotplug event.

If the values differed, you would be saying there's a race condition
that could affect correctness of the rebind (which can't happen,
because this whole thing is done under the hotplug lock btw).

Can you help me understand?

~Gregory

^ permalink raw reply

* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: Nhat Pham @ 2026-06-16 12:15 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Yosry Ahmed, akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	chengming.zhou, ljs, liam, vbabka, rppt, surenb, qi.zheng,
	axelrasmussen, yuanchu, weixugc, riel, gourry, haowenchao22,
	kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajCm44rYpLOKCQ43@yjaykim-PowerEdge-T330>

On Mon, Jun 15, 2026 at 9:29 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Mon, Jun 15, 2026 at 12:56:26PM -0700, Yosry Ahmed wrote:
> > On Sun, Jun 14, 2026 at 7:39 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Sun, Jun 14, 2026 at 4:20 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > ...
> > > > > * Integration with swap.tier by Youngjun (see [12]). For now, I'm
> > > > >   leaning towards opting out the vswap device from swap.tier entirely, and
> > > > >   treat it as a special device. Integrating it with swap.tiers will
> > > > >   benefit the cases where you want some cgroups to skip vswap for fast
> > > > >   swap devices (pmem), whereas other should go through zswap first. But
> > > > >   most other use cases, either the overhead of vswap will be acceptable
> > > > >   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> > > > >
> > > > >   Youngjun, may I ask for your thoughts on this?
> > > >
> > > > Hi Nhat,
> > > >
> > > > Tier 1: VSWAP, Tier 2: ZSWAP ...
> > > >
> > > > I don't see any problem applying the desired functionality with the
> > > > currently proposed mechanism and interface. With this, a user would be
> > > > assigned the default Virtual -> RAM swap tier, and the overall picture
> > > > becomes one where swap tiers are composed according to the priority
> > > > setting.
> > >
> > > It's more - is there a strong argument to let vswap be a tier (which
> > > is not supported by just turning of vswap altogether).
> > >
> > > Because right now I'm not exposing vswap device to userspace in any
> > > manner, pretty much. It's abstract and transparent, and minimizes
> > > complexity (no vswap and swap.tier interaction) and surfaces for
> > > issues.
> >
> > I definitely think vswap should *not* be a tier. First of all, a vswap
> > entry can be backed by zswap or an actual swap device, which would be
> > two different tiers. How does that work?
> >
> > I also think vswap should not be exposed to userspace in any way, at
> > least not now. I still think we should aim to just make the
> > redirection layer always on and eliminate "vswap devices".

Yeah I will just expose a pair of usage/failure for diagnostics purposes :)

>
> After following the answers and giving it some thought, I agree that
> vswap should be kept user-transparent. If there is a strict need to
> disable it, relying on CONFIG_VSWAP to remove it entirely seems like
> the right approach.
>
> If a strong use case for user interaction emerges in the future, we can
> revisit the design and figure out how to handle it at that time.

Yeah the only argument for adding vswap to swap tier is if we want it
to virtualize swap on a per-cgroup basis, assuming:

1. There's a setup where some cgroups benefit from vswap and some
don't, in the same deployment or host (so you can't just use
CONFIG_VSWAP).

2. We can't decide it with some heuristics purely based on kernel's
knowledge (so for e.g, if a cgroup enables zswap, then vswap probably
makes more sense than not, etc. etc.).

Maybe I'm missing something, and if so please let me know. But
otherwise I'll stick with transparent vswap for the next version.

With Youngjun's new interface, if we made a mistake here and
per-cgroup vswapping turned out to be necessary, fixing it is fairly
cheap. We don't need to add any new knob - just need to expose it to
memory.swap.tier somehow, and we're done. That can be done as
follow-up :)

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Brendan Jackman @ 2026-06-16 11:57 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Gregory Price, David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>

On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>>> > > 
>>> > > I understand this question in two ways:
>>> > > 
>>> > >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>>> > 
>>> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
>>> > me reply to that one below.
>>> > 
>>> ... snip ...
>>> > 
>>> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
>>> > context might be better. I think there was also talk about how the memalloc_*
>>> > interface might be a better way forward. Maybe we would start giving the
>>> > allocator more context ("we are allocating a folio").
>>> > 
>>> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>>> >
>>> 
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>> 
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>> 
>> 
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>> 
>> ===
>> 
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done.  It will be based on the dax atomic hotplug
>> 
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>> 
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>> 
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>    on the private node, even if it happens within a memalloc_private
>>    context.  Since memalloc_folio may be useful in contexts outside of
>>    private nodes, I kept this as a separate flag.
>> 
>>    If we think there will *never* be additional users of memalloc_folio,
>>    then we could fold _folio into _private to save the flag for now and
>>    add it back when we actually need it.
>> 
>> 2) memalloc_private is needed to unlock private nodes, but in the
>>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>> 
>>    This is *highly* restrictive.  I found when playing with mbind that
>>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>    implies a bug). 
>> 
>>    That leads me to #3
>
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.

Minor side-note: couldn't we just define it such that the allocator
ignores the context when not in_task() (and warn if you try to enter the
context while not currently in_task())?

(Don't think this would change the conclusion very much, e.g. doesn't
help with the nesting issues. Mostly curious in case I'm missing a
detail here).

> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Shakeel Butt @ 2026-06-16 11:44 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Yosry Ahmed, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <ajCgzNIPLhjTRSXR@yjaykim-PowerEdge-T330>

On Tue, Jun 16, 2026 at 10:03:08AM +0900, YoungJun Park wrote:
> On Mon, Jun 15, 2026 at 12:55:09PM -0700, Yosry Ahmed wrote:
> > > In that case, the internal logic could stay roughly the same rather
> > > than counting via a page counter. Something like:
> > >
> > > 1. Change the interface shell: tier.*.max — allow only 0 ~ max.
> > 
> > What about a single interface as I suggested to remain consistent with
> > memory tiering?
> 
> Hello Yosry!
> 
> I agree. As I was implementing the interface for seeing feasibility
> , I reconsidered it. Since swap tiers can be added or removed at runtime, 
> having static memory."tier_name".max files seems unnatural.
> 
> A single interface like `swap.tiers.max` would be better. We can use a
> flat-keyed format (similar to io.weight. same as you suggested)
> 
> echo ["tier_name"] ["0 or max"] > swap.tiers.max
> 
> I am now leaning towards this is a better direction than what I initially
> suggested (memory.swap.tiers and memory.swap.tiers.effective).
> 
> Considering other reviews and Shakeel's reply, I will update my swap tier
> patch accordingly.

I like Yosry's proposal. Let's go with that.

(I am travelling, so will be slow to respond)

^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: luca abeni @ 2026-06-16  7:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Johannes Weiner,
	Michal Koutný, cgroups, linux-kernel, Yuri Andriaccio
In-Reply-To: <bcedd13fc22cb7ba590791a4c1387ecc@kernel.org>

Hi,

On Mon, 15 Jun 2026 10:38:54 -1000
Tejun Heo <tj@kernel.org> wrote:
[...]
> 2. root's cpu.rt.max: sched_rt_runtime_us already caps total DL/RT
>    bandwidth and rt-cgroups admit against the same pool, so what does
>    reserving the cgroup share separately at root add? It's also a
> writable control on root, which we otherwise keep off root.

Well, /proc/sys/kernel/sched_rt_{runtime,period}_us control the total
fraction of CPU time that can be reserved for dl and rt tasks; here, we
need a way to reserve a part of this fraction for rt cgropus only. We
need this because "regular" dl entities (the ones serving dl tasks) are
"global" (can migrate between different CPU cores), whereas dl entities
serving rt cgroups are pinned to their own CPU cores.

Combining the admission tests for these two different kinds of dl
entities is not simple, so we reserve a fixed fraction of CPU time for
rt cgroups, and the remaining part of
sched_rt_runtime_us/sched_rt_period_us is left to dl tasks.

If there is a better interface we can use to achieve this goal, we are
happy to switch to it.

			Thanks,
				Luca

^ permalink raw reply

* Re: [PATCH v3 08/15] mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
From: Harry Yoo @ 2026-06-16  7:36 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-8-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1650 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> Convert the whole following call stack to pass either slab_alloc_context
> (thus including alloc_flags) or just alloc_flags as necessary:
> 
> slab_post_alloc_hook()
>   alloc_tagging_slab_alloc_hook()
>     __alloc_tagging_slab_alloc_hook()
>       prepare_slab_obj_exts_hook()
>         alloc_slab_obj_exts()
>   memcg_slab_post_alloc_hook()
>     __memcg_slab_post_alloc_hook()
>       alloc_slab_obj_exts()
> 
> Converting all these at once avoids unnecessary churn and is mostly
> mechanical.
> 
> This ultimately allows to decide if spinning is allowed using
> alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
> Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
> else in slab itself relying on gfpflags_allow_spinning() which can
> be false even if not called from kmalloc_nolock().
> 
> A followup change will also use the alloc_flags availability in the call
> stack above to remove the __GFP_NO_OBJ_EXT flag.
> 
> For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
> parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.
> 
> To further reduce the number of parameters of slab_post_alloc_hook(),
> also make 'struct list_lru *lru' (which is NULL for most callers) a new
> field of slab_alloc_context.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/15] mm/slab: pass alloc_flags to new slab allocation
From: Harry Yoo @ 2026-06-16  7:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-7-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 679 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> Add the alloc_flags parameter to allocate_slab() and new_slab()
> so it can be used to determine if spinning is allowed, independently
> from gfp flags.
> 
> refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
> reached from contexts that allow spinning.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-8-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: David Hildenbrand (Arm) @ 2026-06-16  6:59 UTC (permalink / raw)
  To: Waiman Long, Gregory Price
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <c1495b1b-9dee-4cd5-ac8e-eeb7a2d968ed@redhat.com>

On 6/16/26 05:43, Waiman Long wrote:
> On 6/15/26 10:26 PM, Waiman Long wrote:
>>
>> On 6/15/26 5:38 AM, Gregory Price wrote:
>>> All interactions between mempolicy and cpuset are horrible and
>>> confusing.  Much like Lorenzo's anon_vma work, I have to keep
>>> notes on how this whole thing doesn't just spew SIGBUS constantly.
>>>
>>> The short answer is: mempolicy is advisory and cpuset is strictly
>>> followed - in a dispute cpuset wins... except for file backed memory,
>>> then everyon loses and nothing is consistent.
>>
>> That is what I believe why mpol_rebind_mm() a bit differently from the others
>> and it is historically done this way a long time ago before cgroup v2.
>>
>> For cgroup v1, mems_allowed can't be empty or you can't put any task into the
>> cpuset. Also effective_mems is the same as mems_allowed. cgroup v2 is quite
>> different in how it handles memory nodes and CPUs. Users can isn't forced to
>> set mems_allowed and cpus_allowed as effective_mems and effective_cpus will
>> inherit parent version if mems_allowed and cpus_allowed are not set. IOW,
>> effective_mems will never be empty. Yes, it is a bug with the introduction of
>> cpuset v2 that we should have replaced mems_allowed by effective_mems at that
>> time. With v2, effective_mems should contain only online nodes. The only
>> exception is during the short transition period when a memory node hotunplug
>> operation is in progress when a write to cpuset.mems is happening at the same
>> time. With v1, it is theoretically possible that none of the nodes in
>> mems_allowed is online.
>>
>> The reason why I am suggesting to use cs->effective_mems to keep the old
>> cgroup v1 behavior. If the consensus is to use the output of
>> guarantee_online_mems() for mpol_rebind_mm(), I will not be against that but
>> it will be a slight change in user-visible behavior. 
> 
> BTW, I still prefer the v2 patch. If it is decided we should use the
> guarantee_online_mems() value instead, it will have to be a separate patch with
> changes in the relevant documentation like Documentation/admin-guide/cgroup-v1/
> cpuset.rst.

newmems is "obviously" correct, so I really don't see why we should add
something that needs half a page of text to explain why it is fine -- if newmems
just does the trick?

Please enlighten me.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v2 15/16] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Hao Ge @ 2026-06-16  6:47 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Suren Baghdasaryan, Hao Li
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	kasan-dev, linux-mm, linux-kernel, cgroups
In-Reply-To: <e17e628e-0633-4c5e-a9f9-ea68a4ca09df@kernel.org>

Hi Vlastimil and Suren

On 2026/6/15 19:11, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 07:38, Suren Baghdasaryan wrote:
>> On Fri, Jun 12, 2026 at 4:30 AM Hao Li <hao.li@linux.dev> wrote:
>>> On Fri, Jun 12, 2026 at 12:17:45PM +0200, Vlastimil Babka (SUSE) wrote:
>>>> On 6/12/26 08:54, Hao Li wrote:
>>>>> On Wed, Jun 10, 2026 at 05:40:17PM +0200, Vlastimil Babka (SUSE) wrote:
>>>>>> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
>>>>>> gfp flags are a scarce resource, unlike slab's alloc_flags.
>>>>>>
>>>>>> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
>>>>>> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
>>>>>> family function should not recurse into another kmalloc*() for the
>>>>>> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
>>>>>>
>>>>>> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
>>>>>> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
>>>>>> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
>>>>>> added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
>>>>>> to special case kmalloc_nolock() anymore.
>>>>>>
>>>>>> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
>>>>>> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
>>>>>> the incoming gfp flags (only augmented with __GFP_ZERO), because if
>>>>>> alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
>>>>>> be also compatible with it.
>>>>>>
>>>>>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>>>>> ---
>>>>>>   mm/slab.h |  1 +
>>>>>>   mm/slub.c | 13 +++++--------
>>>>>>   2 files changed, 6 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/slab.h b/mm/slab.h
>>>>>> index 45bfcfb35a9c..509f330654b8 100644
>>>>>> --- a/mm/slab.h
>>>>>> +++ b/mm/slab.h
>>>>>> @@ -21,6 +21,7 @@
>>>>>>   #define SLAB_ALLOC_DEFAULT        0x00 /* no flags */
>>>>>>   #define SLAB_ALLOC_TRYLOCK        0x01 /* a kmalloc_nolock() allocation */
>>>>>>   #define SLAB_ALLOC_NEW_SLAB       0x02 /* a flag for alloc_slab_obj_exts() */
>>>>>> +#define SLAB_ALLOC_NO_RECURSE     0x04 /* prevent kmalloc() recursion */
>>>>>>
>>>>>>   static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
>>>>>>   {
>>>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>>>> index cbb38bd01e46..7dfbd0251aa2 100644
>>>>>> --- a/mm/slub.c
>>>>>> +++ b/mm/slub.c
>>>>>> @@ -2167,15 +2167,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>>>>>>
>>>>>>     gfp &= ~OBJCGS_CLEAR_MASK;
>>>>>>     /* Prevent recursive extension vector allocation */
>>>>>> -  gfp |= __GFP_NO_OBJ_EXT;
>>>>>> +  alloc_flags |= SLAB_ALLOC_NO_RECURSE;
>>>>>>
>>>>>>     sz = obj_exts_alloc_size(s, slab, gfp);
>>>>>>
>>>>> For the original calls to kmalloc_nolock and kmalloc_node, I notice a difference:
>>>>>
>>>>>> -  if (unlikely(!allow_spin))
>>>>>> -          vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
>>>>>> -                               slab_nid(slab));
>>>>> kmalloc_nolock completely discarded `gfp` flags.
>>>>>
>>>>>> -  else
>>>>>> -          vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));
>>>>> while kmalloc_node preserved and passed it along.
>>>>>
>>>>>> +  /* This will use kmalloc_nolock() if alloc_flags say so */
>>>>>> +  vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
>>>>> Now both paths are merged into kmalloc_flags, the gfp flags are
>>>>> unconditionally carried through. It seems this might carry some unwanted flags.
>>>>>
>>>>> I traced the call path and found that ___slab_alloc sets the __GFP_THISNODE
>>>>> for trynode_flags. If this flag propagates all the way into
>>>>> kmalloc_flags->...->__kmalloc_nolock_noprof, it will trigger the
>>>>> VM_WARN_ON_ONCE warning. Maybe we need to strip the original gfp if
>>>>> `!allow_spin`.
>>>> Thanks. This should do the job in a more generic way I hope?
>>>>
>>> Yeah, this is more elegant.
>>>
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index f9b8dc56bb57..0bf53f70c9be 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -2047,12 +2047,15 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node,
>>>>   #endif /* CONFIG_SLUB_DEBUG */
>>>>
>>>>   /*
>>>> - * The allocated objcg pointers array is not accounted directly.
>>>> + * The allocated objcg pointers array or sheaf is not accounted directly.
>>>>    * Moreover, it should not come from DMA buffer and is not readily
>>>> - * reclaimable. So those GFP bits should be masked off.
>>>> + * reclaimable. Node restriction for the parent allocation also should
>>>> + * not apply to the slab's internal objects.
>>>> + * So those GFP bits should be masked off.
>>>>    */
>>>>   #define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | \
>>>> -                               __GFP_ACCOUNT | __GFP_NOFAIL)
>>>> +                               __GFP_ACCOUNT | __GFP_NOFAIL |
>>>> +                               __GFP_THISNODE )
>>> Good idea! Both code and comments make sense to me.
>> Makes sense. I see
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
>> already implementing this and also keeping __GFP_NO_OBJ_EXT and
>> SLAB_ALLOC_NO_RECURSE both used. That version looks good to me, so
>> I'll wait for v3.
> OK.
>
>> At the end of this series, we end up with no users of __GFP_NO_OBJ_EXT
>> but we still keep it defined. I'm guessing you leave it because of the
>> new patch [1] which aliases __GFP_NO_OBJ_EXT? I will have to make that
> Yeah.
>
>> mechanism work without a GFP flag, possibly using a similar approach.
>> CC'ing Hao Ge to be in the loop of these changes. I'll work with him
>> on aliminating that __GFP_NO_OBJ_EXT alias.

Glad to work with you on this.

I'm still figuring out a proper solution.

Once I make some progress, I'll start a separate mail

thread for this to avoid disturbing too many people.

Thanks

Best Regards

Hao

> Good, then we can remove the flag completely.
>
>> [1] https://lore.kernel.org/all/20260604024008.46592-1-hao.ge@linux.dev/
>>
>>>>   #ifdef CONFIG_SLAB_OBJ_EXT
>>>>
>>>>
>>> --
>>> Thanks,
>>> Hao

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-16  3:44 UTC (permalink / raw)
  To: Farhad Alemi, Andrew Morton
  Cc: Farhad Alemi, David Hildenbrand, Gregory Price, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <CA+0ovCgfHJHv5d1mzapWWvF-LhjppzDX8NPPLvCPZxPKg8RiYw@mail.gmail.com>


On 6/14/26 9:25 AM, Farhad Alemi wrote:
> Creating a child cpuset where cpuset.mems is never set leads to a div/0
> when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a
> CPU hotplug event.
>
> Reproduction steps:
>   1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
>   2) Move the task into the child cpuset
>   3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
>   4) unplug and hotplug a cpu
>        echo 0 > /sys/devices/system/cpu/cpu1/online
>        echo 1 > /sys/devices/system/cpu/cpu1/online
>   5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
>      call to __nodes_fold()
>
> The cpuset code passes (cs->mems_allowed) which is not guaranteed to have
> nodes to the rebind routine.  Use cs->effective_mems instead, which is
> guaranteed to have a non-empty nodemask.
>
> Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
> Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/
> Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
> Suggested-by: Gregory Price <gourry@gourry.net>
> Suggested-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Cc: stable@vger.kernel.org
> ---
> v2: rebind to cs->effective_mems instead of newmems (Waiman Long);
>      condense the changelog.
>
>   kernel/cgroup/cpuset.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>
>   		migrate = is_memory_migrate(cs);
>
> -		mpol_rebind_mm(mm, &cs->mems_allowed);
> +		mpol_rebind_mm(mm, &cs->effective_mems);
>   		if (migrate)
>   			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
>   		else
Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-16  3:43 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <70f486ce-5ef6-4d72-8cc3-7086f4eea930@redhat.com>

On 6/15/26 10:26 PM, Waiman Long wrote:
>
> On 6/15/26 5:38 AM, Gregory Price wrote:
>> On Mon, Jun 15, 2026 at 10:08:51AM +0200, David Hildenbrand (Arm) wrote:
>>> On 6/14/26 15:25, Farhad Alemi wrote:
>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>> --- a/kernel/cgroup/cpuset.c
>>>> +++ b/kernel/cgroup/cpuset.c
>>>> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct 
>>>> cpuset *cs)
>>>>
>>>>           migrate = is_memory_migrate(cs);
>>>>
>>>> -        mpol_rebind_mm(mm, &cs->mems_allowed);
>>>> +        mpol_rebind_mm(mm, &cs->effective_mems);
>>> God this is confusing.
>>>
>> All interactions between mempolicy and cpuset are horrible and
>> confusing.  Much like Lorenzo's anon_vma work, I have to keep
>> notes on how this whole thing doesn't just spew SIGBUS constantly.
>>
>> The short answer is: mempolicy is advisory and cpuset is strictly
>> followed - in a dispute cpuset wins... except for file backed memory,
>> then everyon loses and nothing is consistent.
>
> That is what I believe why mpol_rebind_mm() a bit differently from the 
> others and it is historically done this way a long time ago before 
> cgroup v2.
>
> For cgroup v1, mems_allowed can't be empty or you can't put any task 
> into the cpuset. Also effective_mems is the same as mems_allowed. 
> cgroup v2 is quite different in how it handles memory nodes and CPUs. 
> Users can isn't forced to set mems_allowed and cpus_allowed as 
> effective_mems and effective_cpus will inherit parent version if 
> mems_allowed and cpus_allowed are not set. IOW, effective_mems will 
> never be empty. Yes, it is a bug with the introduction of cpuset v2 
> that we should have replaced mems_allowed by effective_mems at that 
> time. With v2, effective_mems should contain only online nodes. The 
> only exception is during the short transition period when a memory 
> node hotunplug operation is in progress when a write to cpuset.mems is 
> happening at the same time. With v1, it is theoretically possible that 
> none of the nodes in mems_allowed is online.
>
> The reason why I am suggesting to use cs->effective_mems to keep the 
> old cgroup v1 behavior. If the consensus is to use the output of 
> guarantee_online_mems() for mpol_rebind_mm(), I will not be against 
> that but it will be a slight change in user-visible behavior. 

BTW, I still prefer the v2 patch. If it is decided we should use the 
guarantee_online_mems() value instead, it will have to be a separate 
patch with changes in the relevant documentation like 
Documentation/admin-guide/cgroup-v1/cpuset.rst.

Cheers,
Longman


^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: YoungJun Park @ 2026-06-16  3:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy, youngjun.park
In-Reply-To: <ajCgzNIPLhjTRSXR@yjaykim-PowerEdge-T330>

...
> - "zswap tier only": Only zswap is allowed. Fallback to other swap is
>   blocked.
> - "zswap writeback disabled": zswap is allowed, but if zswap_store()
>   fails, pages can still fall back to other swap devices.

Upon double-checking the code, my previous clarification was wrong.
You are right. Sorry for the confusion. "zswap tier only" is indeed
equivalent to "zswap writeback disabled".
(I'm not sure why I read the code that way...)

As I initially thought, it might be possible to replace the zswap writeback
control with the tiering mechanism.

If we need to keep the existing interface, we can integrate or share the
underlying logic (though the specific details need more thought anyway).

It can be summarized as follows:

- "zswap tier only" + "zswap writeback disable" -> meaningless (noop)
- "zswap tier only" + "zswap writeback enable" -> meaningless (no writabck backend exist)
- "zswap tier with other tiers" + "zswap writeback disable" -> uses only zswap
  (can be replaced by "zswap tier only". This code could be intergrated, modified or something.)
- "zswap tier with other tiers" + "zswap writeback enable" -> works as is

As mentioned in the previous email, the zswap tier on/off control comes as a
bonus (though, as you pointed out, we may need to discuss if it's actually
needed).

BR,
Youngjun

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-16  2:26 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <ai_IHvyptWPcTD0y@gourry-fedora-PF4VCD3F>

On 6/15/26 5:38 AM, Gregory Price wrote:
> On Mon, Jun 15, 2026 at 10:08:51AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/14/26 15:25, Farhad Alemi wrote:
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>>>
>>>   		migrate = is_memory_migrate(cs);
>>>
>>> -		mpol_rebind_mm(mm, &cs->mems_allowed);
>>> +		mpol_rebind_mm(mm, &cs->effective_mems);
>> God this is confusing.
>>
> All interactions between mempolicy and cpuset are horrible and
> confusing.  Much like Lorenzo's anon_vma work, I have to keep
> notes on how this whole thing doesn't just spew SIGBUS constantly.
>
> The short answer is: mempolicy is advisory and cpuset is strictly
> followed - in a dispute cpuset wins... except for file backed memory,
> then everyon loses and nothing is consistent.

That is what I believe why mpol_rebind_mm() a bit differently from the 
others and it is historically done this way a long time ago before 
cgroup v2.

For cgroup v1, mems_allowed can't be empty or you can't put any task 
into the cpuset. Also effective_mems is the same as mems_allowed. cgroup 
v2 is quite different in how it handles memory nodes and CPUs. Users can 
isn't forced to set mems_allowed and cpus_allowed as effective_mems and 
effective_cpus will inherit parent version if mems_allowed and 
cpus_allowed are not set. IOW, effective_mems will never be empty. Yes, 
it is a bug with the introduction of cpuset v2 that we should have 
replaced mems_allowed by effective_mems at that time. With v2, 
effective_mems should contain only online nodes. The only exception is 
during the short transition period when a memory node hotunplug 
operation is in progress when a write to cpuset.mems is happening at the 
same time. With v1, it is theoretically possible that none of the nodes 
in mems_allowed is online.

The reason why I am suggesting to use cs->effective_mems to keep the old 
cgroup v1 behavior. If the consensus is to use the output of 
guarantee_online_mems() for mpol_rebind_mm(), I will not be against that 
but it will be a slight change in user-visible behavior.

Cheers, Longman

>> Naturally I wonder: Why are we not using "task->mems_allowed" (maybe cs vs. tsk
>> was the original bug?), which is effectively just newmems?
>>
> Short answer: task->mems_allowed is protected by the task lock and we
> don't hold the task lock for a foreign task (not-current) over mm
> operations.
>
> Long answer: Reasons and "Stop looking at the spaghetti, it's going to
> break"
>
> ~Gregory
>

^ permalink raw reply

* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: YoungJun Park @ 2026-06-16  1:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	chengming.zhou, ljs, liam, vbabka, rppt, surenb, qi.zheng,
	axelrasmussen, yuanchu, weixugc, riel, gourry, haowenchao22,
	kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAO9r8zPj5EH8Mbpc6N+d1u2eEgoV33f+4s=v-84gaobAodPtUw@mail.gmail.com>

On Mon, Jun 15, 2026 at 12:56:26PM -0700, Yosry Ahmed wrote:
> On Sun, Jun 14, 2026 at 7:39 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Sun, Jun 14, 2026 at 4:20 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > ...
> > > > * Integration with swap.tier by Youngjun (see [12]). For now, I'm
> > > >   leaning towards opting out the vswap device from swap.tier entirely, and
> > > >   treat it as a special device. Integrating it with swap.tiers will
> > > >   benefit the cases where you want some cgroups to skip vswap for fast
> > > >   swap devices (pmem), whereas other should go through zswap first. But
> > > >   most other use cases, either the overhead of vswap will be acceptable
> > > >   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> > > >
> > > >   Youngjun, may I ask for your thoughts on this?
> > >
> > > Hi Nhat,
> > >
> > > Tier 1: VSWAP, Tier 2: ZSWAP ...
> > >
> > > I don't see any problem applying the desired functionality with the
> > > currently proposed mechanism and interface. With this, a user would be
> > > assigned the default Virtual -> RAM swap tier, and the overall picture
> > > becomes one where swap tiers are composed according to the priority
> > > setting.
> >
> > It's more - is there a strong argument to let vswap be a tier (which
> > is not supported by just turning of vswap altogether).
> >
> > Because right now I'm not exposing vswap device to userspace in any
> > manner, pretty much. It's abstract and transparent, and minimizes
> > complexity (no vswap and swap.tier interaction) and surfaces for
> > issues.
> 
> I definitely think vswap should *not* be a tier. First of all, a vswap
> entry can be backed by zswap or an actual swap device, which would be
> two different tiers. How does that work?
> 
> I also think vswap should not be exposed to userspace in any way, at
> least not now. I still think we should aim to just make the
> redirection layer always on and eliminate "vswap devices".

After following the answers and giving it some thought, I agree that
vswap should be kept user-transparent. If there is a strict need to
disable it, relying on CONFIG_VSWAP to remove it entirely seems like
the right approach.

If a strong use case for user interaction emerges in the future, we can
revisit the design and figure out how to handle it at that time.

Thanks,
Youngjun Park

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox